Elevator door system failures are a leading cause of elevator malfunctions, impacting safety and operational efficiency. Existing anomaly detection methods often overlook the relative positional relationships among time-series data sequences. This study aims to propose a novel spatial-temporal information fusion approach to accurately identify abnormal states in elevator door systems.
This paper develops an integrated spatial-temporal fusion framework for anomaly detection. First, operational time-series data are structured into a node graph using a K-Nearest Neighbor (KNN) based method to model complex inter-sequence interactions. Subsequently, a Graph Convolutional Network (GCN) is employed to extract local dependencies and global spatial information, while a Bidirectional Long Short-Term Memory network (BiLSTM) captures temporal evolutionary characteristics. A multi-source information-driven feature fusion mechanism is then designed to enhance model robustness. The proposed method is experimentally validated using real-world elevator door operating data, including vibration and video-derived signals.
The experimental results demonstrate that the proposed method effectively identifies abnormal elevator door states, achieving a high detection accuracy of 96.27%. This confirms the framework's effectiveness and reliability in practical scenarios.
This research presents a novel KNN-based graph construction method that captures dynamic dependencies between time-series sequences based on their relative positions. Furthermore, it develops an integrated framework that concurrently models spatial structural relationships and temporal dynamics, overcoming the limitations of methods that treat these dimensions separately. Finally, it introduces a multi-source feature fusion mechanism that leverages the complementarity of information from different sources and dimensions, significantly enhancing the model's representation capability and robustness under complex operating conditions.
1. Introduction
As a critical subsystem for passenger access, elevator door performance significantly impacts both safety and operational efficiency. Statistics indicate that malfunctions in elevator door systems account for over 50% of all elevator failures, constituting a leading cause of passenger entrapments, service disruptions, and even fatal incidents such as shearing injuries and falls (Lan et al., 2021; Wang et al., 2018). Given the structural complexity and variable operating conditions of elevator door systems, performance degradation often manifests as gradual, latent processes that can culminate in sudden failures. Consequently, reliance solely on traditional periodic inspections or reactive repairs proves insufficient for achieving effective fault prediction and precise performance assessment (An et al., 2021). Therefore, the development of intelligent, real-time anomaly detection methods for elevator door systems is crucial for enhancing reliability, ensuring passenger safety, and optimizing maintenance strategies.
In recent years, elevator operation condition monitoring and fault diagnosis techniques have made significant progress (Esteban et al., 2016; Wang et al., 2024). Numerous researchers have evaluated and predicted the overall health state of elevators utilizing vibration analysis, operational data mining, artificial intelligence algorithms, and related approaches. Recent advances in deep learning for fault diagnosis in mechanical and electromechanical systems, such as multi-scale capsule networks (Lu et al., 2025), vision transformers with time-frequency fusion (Xiao et al., 2025a), and novel convolutional attention mechanisms (Wang et al., 2025), have demonstrated strong feature extraction capabilities. However, their direct application to the specific spatio-temporal dependency problem in elevator door systems remains unexplored. For instance, Niu monitored elevator conditions through time-domain and frequency-domain analysis of vibration signals (Niu et al., 2025). Pan proposed an elevator risk assessment method based on an integration of fuzzy comprehensive evaluation and machine learning (Pan et al., 2023). Guo presented a semi-supervised two-stage deep learning network, comprising a feature selector and a classifier, for elevator condition assessment (Guo et al., 2024a). However, anomaly detection and performance evaluation research on door systems, high-incidence areas for elevator failures that directly impact passenger safety, remains relatively scarce. Existing methods assessing the overall elevator state often struggle to capture the subtle performance degradation characteristics unique to door systems. Moreover, a real-time evaluation system capable of assessing the dynamic performance of door systems under complex interaction conditions is notably lacking. Under such conditions, failures often arise from chain reactions and spatio-temporal coupling among multiple component degradations during operation. For example, belt wear may affect door speed, which in turn exacerbates vibration induced by worn sliders. Therefore, research focused on anomaly detection methods for door systems is of significant importance for advancing predictive maintenance and enabling early fault warning in elevators.
Traditional signal processing and shallow learning methods usually rely on handcrafted features and assume fixed temporal or statistical patterns. However, when applied to elevator door systems, such assumptions often become restrictive. Elevator door anomalies frequently exhibit strong nonlinearity, temporal dependency, and structural coupling among signal segments, which are difficult to capture using conventional approaches. Graph convolutional networks provide a natural way to model such non-Euclidean structural relationships, while deep sequential models enable automatic extraction of hierarchical temporal features. As a result, combining graph convolutional networks with deep temporal models provides a suitable framework for modeling elevator door anomalies.
To address the challenge of capturing the performance characteristics of the door system and the lack of a dynamic evaluation framework, this paper proposes a method that integrates temporal and spatial features from multi-source information to enable accurate detection of elevator door anomalies. This research aims to construct a framework capable of simultaneously capturing spatial and temporal dynamic features and achieving high-precision anomaly state identification. The main contributions of this work are summarized as follows:
A KNN-based method is proposed for constructing a graph structure from time-series data. This structure characterizes spatial correlations between sensor nodes while effectively capturing dynamic dependencies among sequences based on their relative positions.
An integrated spatiotemporal information processing framework is developed. Within this framework, graph convolutional networks extract spatial features encompassing both local node dependencies and global spatial topology. Concurrently, bidirectional long short-term memory networks are employed to capture long-term evolutionary patterns and short-term fluctuation characteristics within the temporal dimension.
To overcome the limitations of single-source signals, a feature fusion mechanism leveraging multi-source information is proposed to enhance representation and detection accuracy. This mechanism fully leverages the complementarity of information from different sources and dimensions, thereby enhancing the model's reliability under complex operating conditions.
The remainder of the paper is organized as follows: Section 2 introduces related work on spatial-temporal and multi-source information fusion. Section 3 details the proposed methodology for multi-source spatiotemporal fusion. Section 4 describes the data acquisition process and presents experimental validation of the proposed method. Finally, Section 5 concludes the paper.
2. Related work
In recent years, with the popularization of industrial Internet of Things (IoT) technology, multi-dimensional data such as vibration, sound, and video of door opening and closing generated during the operation of elevator door systems have been realized to be collected in real time (Xie et al., 2024; Hsu et al., 2020; Pan et al., 2024). These data contain rich operational state and safety information, which not only reflect the dynamic characteristics of the system evolving over time, but also reflect the interrelationship of the states of the system at different locations. Owing to their advantage in modeling non-Euclidean spatial relationships in sensor networks, graph convolutional networks have found widespread application in areas such as traffic flow prediction and social behavior analysis since Kipf and Welling introduced the semi-supervised framework (Kipf and Welling, 2016). Yang combined capsule networks with GCN for harmonic drive compound fault diagnosis (Yang et al., 2023a). Seo proposed a method combining graph autoencoders with graph convolutional networks to model dynamic relationships among subsystems in hydrogen extraction systems, enabling effective anomaly detection and fault diagnosis (Seo et al., 2024). Chen proposed the SA-GCN framework, which integrates structural analysis prior knowledge with measurement data (Chen et al., 2022). Xiao proposed MV-TriGCN, a semi-supervised multi-view learning framework that enhances GCN stability and generalization through an improved triplet loss, diverse view graph construction, and stepwise training (Xiao et al., 2025b). Most of the above studies focus on modeling the spatial structural features of sensors. However, in practice, sensor data often contain both temporal evolution patterns and spatial dependencies, demonstrating tightly intertwined spatial-temporal characteristics with distinct yet interacting feature dimensions. Therefore, a comprehensive consideration of both temporal and spatial dimensions is needed to more fully explore the data's intrinsic structure and potential semantics.
Furthermore, in the field of intelligent operation and maintenance of industrial equipment, the integration of multi-source sensor data for spatial-temporal modeling has emerged as a core technical approach for condition monitoring and performance evaluation (Wu et al., 2024; Han et al., 2025; Lv et al., 2025). Fathizadan proposed a spatio-temporal anomaly detection framework using convolutional LSTM autoencoders and control charts to effectively identify anomalies in additive manufacturing processes (Fathizadan et al., 2024). Sun performed fault diagnosis through temporal node connection modeling (Sun and Yin, 2025). Guo proposed a time–frequency domain interpretation method for CNNs in bearing fault diagnosis, enhancing model transparency through Grad-CAM and gradient-ascent-based kernel visualization (Guo et al., 2024b). Allen proposed the KESA framework, which combines deep learning with domain knowledge for spatiotemporal fault detection and interpretation in complex industrial processes (Allen et al., 2024). Addressing the limitations of multi-sensor long-term sequence modeling in structural health monitoring, Yang utilized a two-layer attention mechanism to enhance LSTM's ability to capture spatial-temporal dynamic features (Yang et al., 2023b). Zhou realized spatial-temporal joint modeling through the coupling of multi-view graph attention and temporal convolution (Zhou and Wang, 2024). However, the above methods generally define the graph structure based on the static correlations of sensors, while ignoring the dynamic dependencies between time series data points based on their relative temporal positions. This omission of temporal association patterns hinders the model's ability to fully capture the complex spatiotemporal interactions during equipment operation, consequently limiting diagnostic accuracy and robustness.
Although deep learning has significantly promoted the development of intelligent fault diagnosis techniques, methods relying on a single signal source often face a generalization bottleneck due to insufficient feature characterization capability under complex and severe working conditions. For this reason, multi-source heterogeneous data fusion technology has become an important research direction. Currently, fault diagnosis research based on information fusion is conducted at three levels, i.e., the data level, feature level, and decision level. Zhang employed a residual pyramid algorithm to separately fuse acoustic and vibration signals from multiple spatial locations, generating two fused acoustic-vibration signals, and subsequently constructed a multi-source fault feature set (Zhang et al., 2022a). Feng proposed a multisource state space-based method for tool RUL prediction that models multistage degradation using a Wiener process, achieving improved prediction accuracy by integrating historical and real-time monitoring data (Feng et al., 2025a). Xu proposed a dynamic feature selection matrix optimization information integration method to transform a multi-source information system into a single-source information system (Xu and Li, 2025). Feng proposed a diagnostic framework based on vibration, electrostatic, and infrared multi-source heterogeneous data fusion, and achieved online diagnosis of gearbox faults through data conversion and feature fusion (Feng et al., 2025b). While these fusion methods have shown promise in fault diagnosis, they also raise concerns regarding data transformation and feature redundancy, as well as decision ambiguity.
In summary, current research has made significant progress in graph structure construction and multi-source data fusion, especially in modeling spatial structure using GCN and capturing temporal dependence with the help of models such as LSTM. However, most of the existing methods deal with the temporal and spatial dimensions separately, ignoring the interactive characteristics of temporal location and structural relationships among sequences, which makes it difficult to comprehensively portray the dynamic evolution mechanism in equipment operation. To this end, this paper proposes a multi-source information anomaly detection method that fuses temporal and spatial features. This method accurately identifies abnormal states of the elevator door system by deeply extracting latent temporal and spatial dependencies from the data and effectively incorporating them within a multi-source information fusion framework.
3. Methodology
3.1 Problem formulation and overall architecture
Elevator door anomaly detection focuses on identifying abnormal operating states during door opening and closing processes based on multi-source time series data. In practical operation, door system faults usually do not appear as isolated signal fluctuations, but rather manifest as localized deviations in temporal evolution and abnormal interactions among different signal segments. Such patterns are typically associated with interactions among multiple degraded components during door operation.
Therefore, elevator door anomaly detection is treated as a spatio temporal pattern recognition task, in which both the temporal evolution of individual signals and the dependency relationships among time series segments are considered. Effective detection requires capturing long-term trends and short-term fluctuations in the temporal domain, as well as modeling the structural relationships that reflect similarity and interaction among segments under different operating conditions. Based on this consideration, a spatio temporal modeling framework is developed to support accurate identification of abnormal door operation states.
To address this problem, this study proposes a multi-source spatio temporal information fusion framework for elevator door anomaly detection, as illustrated in Figure 1. The overall framework integrates temporal feature extraction, spatial dependency modeling, and multi-source feature fusion to achieve accurate identification of abnormal door operation states.
The conceptual workflow diagram shows a multi-source signal analysis system that combines vibration and video signals using a “T S G C N Architecture” (Temporal–Spatial Graph Convolutional Network) for classification. The diagram is organized into four main sections labeled “Multi-source Signal Acquisition”, “T S G C N Architecture”, “Feature Fusion”, and “Result”. “Multi-source Signal Acquisition”: This section appears on the left side of the diagram and illustrates the input data sources. A smartphone icon indicates a vibration signal acquisition device. Three stacked waveform plots show vibration signals measured along three axes labeled “x-axis”, “y-axis”, and “z-axis”. Each waveform is displayed in a different color: blue for the x-axis, green for the y-axis, and red for the z-axis, representing time-varying vibration amplitudes. Below the vibration signals, a camera icon represents video signal acquisition. A step-like blue waveform labeled “Video Signal” illustrates the temporal representation of video-derived features or motion information extracted from recorded frames. Right-pointing arrows from both vibration and video signals indicate that these data streams are sent to the processing architecture. “T S G C N Architecture”: The central section illustrates the processing architecture labeled “T S G C N Architecture”. It is divided into two main sections: “Temporal Feature Extraction” at the top and “Spatial Feature Extraction” at the bottom. “Temporal Feature Extraction”: This section is enclosed within a dashed boundary and further divided into “Local Feature Extraction” and “Global Feature Extraction”. “Local Feature Extraction”: On the left side, a sequence of vertical feature blocks represents feature maps processed through convolutional operations. Arrows indicate the flow through layers labeled “CONV layer 1”, “Pooling layer 1”, “CONV layer 2”, and “Pooling layer 2”. These layers extract temporal patterns from input signals by progressively reducing dimensionality while preserving important features. “Global Feature Extraction”: On the right side, a graph-based recurrent structure processes temporal dependencies. Nodes labeled “x subscript 1” and “x subscript t” represent input features at different time steps. These connect to intermediate nodes labeled “h subscript 1” and “h subscript t” through weighted edges labeled “w subscript 1”, “w subscript 2”, “w subscript 3”, “w subscript 4”, “w subscript 5”, and “w subscript 6”. Two pathways labeled “Forward layer” and “Backward layer” indicate bidirectional processing of temporal information. Arrows between nodes show how information flows forward and backward across time steps. Output nodes labeled “y subscript 1” and “y subscript t” represent the extracted temporal features after global processing. “Spatial Feature Extraction”: This section appears below and focuses on extracting relationships between features using graph-based methods. Multiple stacked blocks labeled “Graph Convolution” illustrate repeated graph convolution operations. Inside each block, a network of interconnected nodes represents a graph structure where nodes exchange information. Each graph convolution block is followed by layers labeled “Batch Norm” and “R e L U”, indicating normalization and nonlinear activation. On the right side of the spatial section, a module labeled “Graph Readout” aggregates node-level features into a single global representation. A graph with connected nodes is shown, followed by a red circular output node. The aggregation method is labeled “global average pooling”, indicating that features from all nodes are averaged to produce the final output. “Feature Fusion”: The lower-left section shows how features from different modalities are combined. It is divided into two main parts: feature fusion at the top and a neural network classifier at the bottom. On the upper left side, a dashed box labeled “Fusion of Temporal Features” displays two horizontal rows of circular nodes representing temporal feature vectors. The top row is labeled “Vibrate Features” and contains a sequence of circular nodes representing temporal features extracted from vibration signals. The bottom row is labeled “Video Features” and contains a similar sequence of circular nodes representing temporal features extracted from video data. A plus symbol between the two rows indicates that vibration and video temporal features are combined to produce a fused temporal representation. On the upper right side, another dashed box labeled “Fusion of Spatial Features” shows a similar structure. The upper row represents spatial features derived from vibration signals, while the lower row represents spatial features derived from video signals. Each row contains circular nodes representing feature elements. A plus symbol between the rows indicates that the spatial features from both modalities are fused together. Below the two fusion blocks, arrows from both the temporal and spatial fusion outputs converge into a label “Concat”, indicating that the fused temporal and spatial features are concatenated into a single combined feature vector. The concatenated feature vector is passed into a neural network classifier illustrated in the lower section. A vertical column of nodes represents the input feature vector. An arrow labeled “R e L U” indicates the application of the Rectified Linear Unit activation function. The features then pass through a fully connected neural network layer illustrated by multiple nodes connected with lines, representing learned weights between layers. On the right side, a vertical column of nodes labeled “Output Classes” represents the final classification results. Each node corresponds to a predicted class category generated by the model. “Result”: The rightmost section presents the final evaluation results, consisting of two visualizations: a scatter plot of classification outputs and a confusion matrix summarizing model performance. At the top, a scatter plot displays clustered data points representing different classes predicted by the model. The plot includes a legend labeled “Class” with four categories: “Jamming Fault”, “Door Control Fault”, “Slowdown Fault”, and “Normal”. Each class is represented by a distinct color. The data points form four clearly separated clusters in different regions of the plot, indicating strong class separation. One cluster appears on the left side around negative horizontal values, another cluster appears near the upper center, a third cluster is slightly lower but still near the center-right, and a fourth cluster appears on the lower right side. The separation between clusters suggests that the model effectively distinguishes between different fault conditions and normal operation. Below the scatter plot, a matrix labeled “Confusion Matrix” presents classification performance in a grid format. The vertical axis is labeled “True”, and the horizontal axis is labeled “Predicted”, with class indices ranging from 0 to 3. The matrix contains four rows and four columns with numerical values indicating prediction counts: Row 0 (True class 0): 38 correct predictions, with 1 misclassified as class 1 and 1 as class 3. Row 1 (True class 1): 38 correct predictions, with 2 misclassified as class 0. Row 2 (True class 2): 40 correct predictions, with no misclassifications. Row 3 (True class 3): 39 correct predictions, with 2 misclassified as class 1. The diagonal values are high compared to off-diagonal values, indicating strong overall classification accuracy. Misclassifications are minimal and occur only between a few class pairs. Note: All numerical data values are approximated.Convolutional network fusion model for spatial-temporal maps. Source(s): Figure created by authors
The conceptual workflow diagram shows a multi-source signal analysis system that combines vibration and video signals using a “T S G C N Architecture” (Temporal–Spatial Graph Convolutional Network) for classification. The diagram is organized into four main sections labeled “Multi-source Signal Acquisition”, “T S G C N Architecture”, “Feature Fusion”, and “Result”. “Multi-source Signal Acquisition”: This section appears on the left side of the diagram and illustrates the input data sources. A smartphone icon indicates a vibration signal acquisition device. Three stacked waveform plots show vibration signals measured along three axes labeled “x-axis”, “y-axis”, and “z-axis”. Each waveform is displayed in a different color: blue for the x-axis, green for the y-axis, and red for the z-axis, representing time-varying vibration amplitudes. Below the vibration signals, a camera icon represents video signal acquisition. A step-like blue waveform labeled “Video Signal” illustrates the temporal representation of video-derived features or motion information extracted from recorded frames. Right-pointing arrows from both vibration and video signals indicate that these data streams are sent to the processing architecture. “T S G C N Architecture”: The central section illustrates the processing architecture labeled “T S G C N Architecture”. It is divided into two main sections: “Temporal Feature Extraction” at the top and “Spatial Feature Extraction” at the bottom. “Temporal Feature Extraction”: This section is enclosed within a dashed boundary and further divided into “Local Feature Extraction” and “Global Feature Extraction”. “Local Feature Extraction”: On the left side, a sequence of vertical feature blocks represents feature maps processed through convolutional operations. Arrows indicate the flow through layers labeled “CONV layer 1”, “Pooling layer 1”, “CONV layer 2”, and “Pooling layer 2”. These layers extract temporal patterns from input signals by progressively reducing dimensionality while preserving important features. “Global Feature Extraction”: On the right side, a graph-based recurrent structure processes temporal dependencies. Nodes labeled “x subscript 1” and “x subscript t” represent input features at different time steps. These connect to intermediate nodes labeled “h subscript 1” and “h subscript t” through weighted edges labeled “w subscript 1”, “w subscript 2”, “w subscript 3”, “w subscript 4”, “w subscript 5”, and “w subscript 6”. Two pathways labeled “Forward layer” and “Backward layer” indicate bidirectional processing of temporal information. Arrows between nodes show how information flows forward and backward across time steps. Output nodes labeled “y subscript 1” and “y subscript t” represent the extracted temporal features after global processing. “Spatial Feature Extraction”: This section appears below and focuses on extracting relationships between features using graph-based methods. Multiple stacked blocks labeled “Graph Convolution” illustrate repeated graph convolution operations. Inside each block, a network of interconnected nodes represents a graph structure where nodes exchange information. Each graph convolution block is followed by layers labeled “Batch Norm” and “R e L U”, indicating normalization and nonlinear activation. On the right side of the spatial section, a module labeled “Graph Readout” aggregates node-level features into a single global representation. A graph with connected nodes is shown, followed by a red circular output node. The aggregation method is labeled “global average pooling”, indicating that features from all nodes are averaged to produce the final output. “Feature Fusion”: The lower-left section shows how features from different modalities are combined. It is divided into two main parts: feature fusion at the top and a neural network classifier at the bottom. On the upper left side, a dashed box labeled “Fusion of Temporal Features” displays two horizontal rows of circular nodes representing temporal feature vectors. The top row is labeled “Vibrate Features” and contains a sequence of circular nodes representing temporal features extracted from vibration signals. The bottom row is labeled “Video Features” and contains a similar sequence of circular nodes representing temporal features extracted from video data. A plus symbol between the two rows indicates that vibration and video temporal features are combined to produce a fused temporal representation. On the upper right side, another dashed box labeled “Fusion of Spatial Features” shows a similar structure. The upper row represents spatial features derived from vibration signals, while the lower row represents spatial features derived from video signals. Each row contains circular nodes representing feature elements. A plus symbol between the rows indicates that the spatial features from both modalities are fused together. Below the two fusion blocks, arrows from both the temporal and spatial fusion outputs converge into a label “Concat”, indicating that the fused temporal and spatial features are concatenated into a single combined feature vector. The concatenated feature vector is passed into a neural network classifier illustrated in the lower section. A vertical column of nodes represents the input feature vector. An arrow labeled “R e L U” indicates the application of the Rectified Linear Unit activation function. The features then pass through a fully connected neural network layer illustrated by multiple nodes connected with lines, representing learned weights between layers. On the right side, a vertical column of nodes labeled “Output Classes” represents the final classification results. Each node corresponds to a predicted class category generated by the model. “Result”: The rightmost section presents the final evaluation results, consisting of two visualizations: a scatter plot of classification outputs and a confusion matrix summarizing model performance. At the top, a scatter plot displays clustered data points representing different classes predicted by the model. The plot includes a legend labeled “Class” with four categories: “Jamming Fault”, “Door Control Fault”, “Slowdown Fault”, and “Normal”. Each class is represented by a distinct color. The data points form four clearly separated clusters in different regions of the plot, indicating strong class separation. One cluster appears on the left side around negative horizontal values, another cluster appears near the upper center, a third cluster is slightly lower but still near the center-right, and a fourth cluster appears on the lower right side. The separation between clusters suggests that the model effectively distinguishes between different fault conditions and normal operation. Below the scatter plot, a matrix labeled “Confusion Matrix” presents classification performance in a grid format. The vertical axis is labeled “True”, and the horizontal axis is labeled “Predicted”, with class indices ranging from 0 to 3. The matrix contains four rows and four columns with numerical values indicating prediction counts: Row 0 (True class 0): 38 correct predictions, with 1 misclassified as class 1 and 1 as class 3. Row 1 (True class 1): 38 correct predictions, with 2 misclassified as class 0. Row 2 (True class 2): 40 correct predictions, with no misclassifications. Row 3 (True class 3): 39 correct predictions, with 2 misclassified as class 1. The diagonal values are high compared to off-diagonal values, indicating strong overall classification accuracy. Misclassifications are minimal and occur only between a few class pairs. Note: All numerical data values are approximated.Convolutional network fusion model for spatial-temporal maps. Source(s): Figure created by authors
Specifically, the proposed framework consists of a temporal modeling module, a spatial modeling module, and a feature fusion and classification module. The input of the framework includes multi-source signals collected during elevator door operation, such as video-derived speed sequences and vibration signals. These signals are first preprocessed and segmented into time series samples, which serve as the basic units for subsequent spatio temporal modeling.
In the temporal modeling module, CNN are employed to extract local temporal features, while BiLSTM are used to capture long term temporal dependencies. In addition, a multi-head attention pooling mechanism is introduced to emphasize time steps that are more sensitive to anomaly characteristics, thereby enhancing the discriminative ability of temporal features.
In parallel, the spatial modeling module aims to characterize the dependency relationships among time series segments. Each segment is treated as a node in the graph, and a K nearest neighbor based graph construction strategy is used to establish similarity connections among nodes. Graph convolutional networks are then applied to the constructed graph to extract spatial features that reflect localized and sample specific interactions associated with abnormal door behavior.
Finally, the temporal features and spatial features extracted from different sources are fused to form a unified spatio temporal representation. This fused representation is fed into a fully connected classifier to output the anomaly category of the elevator door system.
The structure of the fusion module is expressed as:
Where represents the fused spatial-temporal feature set. represents the multi-source temporal feature set. represents the multi-source spatial feature set. represents the weight, and represents the bias. epresents the output probability distribution over each state category.
Through this integrated spatio temporal modeling framework, the proposed method enables effective anomaly detection under complex operating conditions.
3.2 KNN graph construction
Selecting an appropriate graph construction strategy is crucial for modeling the dynamic interactions in elevator door operation. Existing approaches include fully connected graphs, correlation-based graphs, and physical topology-based graphs. Fully connected graphs introduce dense and redundant connections, which may lead to over-smoothing and reduced discriminative ability in graph convolutional networks (Li et al., 2018). Correlation-based graph construction methods rely on global statistical dependency and typically produce static graphs, which may be insufficient to characterize localized and transient interactions in time-series segments (Wu et al., 2020). Physical topology-based graphs require explicit prior knowledge of sensor layouts or mechanical structures; however, in elevator door systems, the interactions among temporal segments are primarily driven by motion consistency and feature similarity rather than fixed spatial adjacency.
To address these limitations, we employ a KNN based approach to construct a dynamic, sample-specific graph. This method connects each time series segment to its k most similar neighbors in the feature space, effectively capturing localized relationships that are indicative of anomalous patterns. The KNN graph is adaptive to each sample and preserves locality, making it particularly suitable for detecting the subtle, localized deviations characteristic of elevator door faults (Qi et al., 2021; Seo et al., 2024).
The detailed construction process is as follows.
A time series sample of length can be represented as:
Before constructing the node graph, global normalization of the sequences is required to eliminate numerical scale differences between different samples. The standardization formula is as follows:
where represents the standardized sample value at time . denotes the original time series sample value at time . and stand for the global mean and standard deviation across all time points, respectively.
Subsequently, the sliding window method is used to extract node features. At time , a sliding window, consisting of the time point itself and the subsequent time points, serves as the node input. The features of the th graph node are then constructed as follows:
This approach effectively captures the dynamic changes of local time slices and enhances the temporal awareness of nodes.
The graph structure data is further constructed based on the node features and the topology of the graph. It can be represented as follows:
where is the node feature matrix, with being the number of nodes and being the feature dimension of a single node. represents the edge set. represents the adjacency matrix of the graph.
To establish structural connections between nodes, the edge set is constructed using the KNN method with Euclidean distance (Qi et al., 2021). For each node , the nearest non-self nodes in the feature space are selected to form a neighbor set , which takes the following form:
Where and are the normalized node features. denotes the Euclidean distance. And denotes the selection of the neighboring nodes with the smallest distance.
While constructing edges, a gaussian kernel function is used to determine the edge weights to enhance the model's ability to model neighbor similarity:
where is the kernel function bandwidth coefficient, which controls the attenuation of adjacency strength. The construction of the time series graph structure using a sliding window and K-nearest neighbors is illustrated in Figure 2.
The conceptual workflow shows how raw data is transformed into a graph structure based on relationships between features or nodes. On the left side, a box labeled “Data” contains a simple line plot representing an input signal or time-series data. Below it, an arrow points downward to a small network diagram composed of circular nodes connected by lines, indicating an initial graph representation derived from the data. In the center, a large rounded box illustrates how relationships between nodes are computed. Inside this box, two elements labeled “A subscript i” and “A subscript j” represent two nodes or features. A function is defined below them as “A subscript i, j equals f (A subscript i, A subscript j)”. To the right, two possible outcomes are shown: when “A subscript i, j equals 0”, the nodes “A subscript i” and “A subscript j” are displayed without a connecting line, indicating no edge between them; when “A subscript i, j equals 1”, the nodes are connected by a line, indicating the presence of an edge. On the far right, an arrow points to a more complex network graph with multiple nodes and connections.Construction of time series graph structure. Source(s): Figure created by authors
The conceptual workflow shows how raw data is transformed into a graph structure based on relationships between features or nodes. On the left side, a box labeled “Data” contains a simple line plot representing an input signal or time-series data. Below it, an arrow points downward to a small network diagram composed of circular nodes connected by lines, indicating an initial graph representation derived from the data. In the center, a large rounded box illustrates how relationships between nodes are computed. Inside this box, two elements labeled “A subscript i” and “A subscript j” represent two nodes or features. A function is defined below them as “A subscript i, j equals f (A subscript i, A subscript j)”. To the right, two possible outcomes are shown: when “A subscript i, j equals 0”, the nodes “A subscript i” and “A subscript j” are displayed without a connecting line, indicating no edge between them; when “A subscript i, j equals 1”, the nodes are connected by a line, indicating the presence of an edge. On the far right, an arrow points to a more complex network graph with multiple nodes and connections.Construction of time series graph structure. Source(s): Figure created by authors
3.3 Spatio-temporal feature extraction based on TSGCN
Based on the KNN constructed dynamic graph described in Section 3.2, the spatio temporal feature extraction stage aims to jointly model temporal evolution patterns and dependency relationships among time series segments during elevator door operation. In this work, each time series segment is treated as a graph node, while edges represent similarity based interactions captured by the KNN graph. Under this formulation, elevator door anomaly detection requires simultaneous characterization of temporal dynamics within each node and spatial dependencies across nodes.
To meet this requirement, a temporal spatial graph convolutional network (TSGCN) is employed as the core feature extraction model. The TSGCN integrates temporal modeling modules with graph convolution operations, enabling effective extraction of both time-domain characteristics and structural information embedded in the constructed graph. This design allows the model to capture localized abnormal patterns and their propagation across related segments, which are critical for accurate detection of elevator door anomalies.
3.3.1 Time dimension model
To extract short-term fluctuations and long-term structural features in time series data, a temporal modeling module consisting of CNN, BiLSTM and multi-head attention pooling is proposed. The module sequentially extracts local dynamic patterns by convolution, captures global temporal dependencies by BiLSTM, and weightedly aggregates key time steps by the multi-head attention mechanism, finally generating discriminative sequence-level representations.
This module captures uptabr changes, local patterns, or instantaneous pattern variations in signals within a short period, leveraging the local receptive field sliding window of convolutional neural network. It is well-suited for information with obvious local structures, such as vibration signals and image velocity curves. Two layers of 1-D convolution and pooling operations are used to compress the sequence length and enhance the extraction of local features.
Subsequently, the convolved sequence features are fed into a bidirectional LSTM network to integrate contextual information and establish long-term dependencies in the temporal dimension. Compared to a unidirectional LSTM, BiLSTM employs parallel computation with both forward and backward LSTM units, effectively merging preceding and succeeding information at each time step to capture the signal's global temporal structure more comprehensively.
In BiLSTM, the forward layer computes sequentially from to , storing the forward hidden state at each time step. Conversely, the backward layer computes in reverse order from to , recording the backward hidden state (Schuster and Paliwal, 1997). Finally, the hidden states from both directions are concatenated to form the output representation for the current time step, which is then used for subsequent decision-making or feature fusion.
The computational procedure can be formalized as:
where represents the input at time step . is the activation function. denotes the output transformation, and is the weight parameters.
To obtain a global representation of the entire sequence, a multi-attention pooling mechanism is introduced to replace the traditional average pooling or maximum pooling. This mechanism learns different attention weights, enabling the selection of the most discriminative time step information for the classification task. The computational process is as follows:
where represents the feature dimension of the key/query in each attention head.
Multiple attention heads compute attention responses in parallel across different subspaces. These responses are then concatenated and projected to obtain a unified representation.
The overall time dimension model is shown in Figure 3.
The detailed pipeline for feature extraction from time-series data is divided into three labeled sections: “Local Feature Extraction”, “Global Feature Extraction”, and “Key Features”. On the left side of the section “Local Feature Extraction”, an input signal is shown as a vertical waveform plot. Segmented portions of the signal are highlighted and fed into two parallel processing streams with an ellipsis between them. Each stream begins with a block labeled “CONV”, representing convolutional layers applied to extract local patterns. The output passes through blocks labeled “B N” (batch normalization) and “R e L U” activation. Circular nodes labeled “Max” indicate “Max pooling” operations that reduce dimensionality while preserving important features. This sequence—“CONV”, “B N”, “R e L U”, and “Max pooling”—is repeated twice in each stream, producing stacked feature maps. The outputs from multiple streams are then combined into vertical feature vectors, each shown by a rectangle containing stacked circular nodes, representing extracted local temporal features. In the center, a section labeled “Global Feature Extraction” models temporal dependencies using a bidirectional structure. Input nodes labeled “x subscript 1” and “x subscript t” represent features at different time steps. These connect to hidden nodes labeled “vector h subscript 1” and “vector h subscript t” through weighted connections labeled “w subscript 1”, “w subscript 2”, “w subscript 3”, “w subscript 4”, “w subscript 5”, and “w subscript 6”. Two pathways are shown: a “Forward layer” and a “Backward layer”, indicating bidirectional processing of temporal information. Arrows illustrate the flow of information across time steps in both directions. Output nodes labeled “y subscript 1” and “y subscript t”, with an ellipsis between them, represent globally extracted temporal features. On the right side, a section labeled “Key Features” applies a block labeled “Multi-head Attention”. This module takes the globally extracted features and computes attention weights to emphasize the most important temporal information. The output is a set of three stacked circular nodes, with an ellipsis, labeled “Temporal Features”, representing refined feature vectors after attention-based selection.Time dimension model. Source(s): Figure created by authors
The detailed pipeline for feature extraction from time-series data is divided into three labeled sections: “Local Feature Extraction”, “Global Feature Extraction”, and “Key Features”. On the left side of the section “Local Feature Extraction”, an input signal is shown as a vertical waveform plot. Segmented portions of the signal are highlighted and fed into two parallel processing streams with an ellipsis between them. Each stream begins with a block labeled “CONV”, representing convolutional layers applied to extract local patterns. The output passes through blocks labeled “B N” (batch normalization) and “R e L U” activation. Circular nodes labeled “Max” indicate “Max pooling” operations that reduce dimensionality while preserving important features. This sequence—“CONV”, “B N”, “R e L U”, and “Max pooling”—is repeated twice in each stream, producing stacked feature maps. The outputs from multiple streams are then combined into vertical feature vectors, each shown by a rectangle containing stacked circular nodes, representing extracted local temporal features. In the center, a section labeled “Global Feature Extraction” models temporal dependencies using a bidirectional structure. Input nodes labeled “x subscript 1” and “x subscript t” represent features at different time steps. These connect to hidden nodes labeled “vector h subscript 1” and “vector h subscript t” through weighted connections labeled “w subscript 1”, “w subscript 2”, “w subscript 3”, “w subscript 4”, “w subscript 5”, and “w subscript 6”. Two pathways are shown: a “Forward layer” and a “Backward layer”, indicating bidirectional processing of temporal information. Arrows illustrate the flow of information across time steps in both directions. Output nodes labeled “y subscript 1” and “y subscript t”, with an ellipsis between them, represent globally extracted temporal features. On the right side, a section labeled “Key Features” applies a block labeled “Multi-head Attention”. This module takes the globally extracted features and computes attention weights to emphasize the most important temporal information. The output is a set of three stacked circular nodes, with an ellipsis, labeled “Temporal Features”, representing refined feature vectors after attention-based selection.Time dimension model. Source(s): Figure created by authors
3.3.2 Spatial dimension model
Spatial dimension models aim to characterize the structural relationships between different nodes within a signal. Given the low dimensionality and limited expressive power of original node features, a linear mapping is employed to increase their dimensionality, as expressed below:
where is the ascending weight matrix and is the bias vector.
After that, the upscaled features are fed into the GCN to construct the structural relationships between nodes. The adjacency matrix of the graph is , where indicates that node and node are connected. A symmetric normalization method is used to construct the graph propagation matrix, expressed as follows (Seo et al., 2024):
where is the identity matrix and is the corresponding degree matrix.
The propagation and updating process of the th layer graph convolution is expressed as:
where denotes the node features of the th layer. is the learnable parameter of the layer, and is the nonlinear activation function.
The overall spatial dimensional model is shown in Figure 4.
The diagram presents a pipeline for extracting spatial features from time-series data using graph convolutional networks. The diagram is divided into several labeled sections: “Data”, “G C N 1”, “G C N 2”, “G C N 3”, and the final output labeled “Spatial Features”, each connected by a right-pointing arrow. On the left side, a dashed box labeled “Data” contains a waveform plot representing a time-series signal. Below the waveform, a legend shows colored circular nodes labeled “Sample point 1”, “Sample point 2”, “Sample point 3”, followed by an ellipsis, and “Sample point n”. These colored nodes represent individual data samples that will be treated as nodes in a graph. An arrow points from the data section toward the first graph convolution block. The first processing block is labeled “G C N 1”. At the top of the block, a diagram labeled “Graph Convolution” shows a small network of connected nodes representing the graph structure. Below it, two sequential layers are labeled “Batch Norm” and “R e L U”, indicating batch normalization followed by a rectified linear unit activation. The second block labeled “G C N 2” repeats the same structure. A graph convolution layer processes node relationships, followed by a “Batch Norm” layer and a “R e L U” activation layer. The third block labeled “G C N 3” again contains a “Graph Convolution” diagram followed by “BatchNorm” and “R e L U”. Arrows between the blocks show the flow of information from one layer to the next. After the third graph convolution block, the output is passed to a node labeled “G A P”, which stands for “Global Average Pooling”. This operation aggregates node-level information into a single feature representation. The final output appears as a vertical column of circular nodes labeled “Spatial Features”, representing the extracted spatial feature vector derived from the graph-based processing of the data.Spatial dimension model. Source(s): Figure created by authors
The diagram presents a pipeline for extracting spatial features from time-series data using graph convolutional networks. The diagram is divided into several labeled sections: “Data”, “G C N 1”, “G C N 2”, “G C N 3”, and the final output labeled “Spatial Features”, each connected by a right-pointing arrow. On the left side, a dashed box labeled “Data” contains a waveform plot representing a time-series signal. Below the waveform, a legend shows colored circular nodes labeled “Sample point 1”, “Sample point 2”, “Sample point 3”, followed by an ellipsis, and “Sample point n”. These colored nodes represent individual data samples that will be treated as nodes in a graph. An arrow points from the data section toward the first graph convolution block. The first processing block is labeled “G C N 1”. At the top of the block, a diagram labeled “Graph Convolution” shows a small network of connected nodes representing the graph structure. Below it, two sequential layers are labeled “Batch Norm” and “R e L U”, indicating batch normalization followed by a rectified linear unit activation. The second block labeled “G C N 2” repeats the same structure. A graph convolution layer processes node relationships, followed by a “Batch Norm” layer and a “R e L U” activation layer. The third block labeled “G C N 3” again contains a “Graph Convolution” diagram followed by “BatchNorm” and “R e L U”. Arrows between the blocks show the flow of information from one layer to the next. After the third graph convolution block, the output is passed to a node labeled “G A P”, which stands for “Global Average Pooling”. This operation aggregates node-level information into a single feature representation. The final output appears as a vertical column of circular nodes labeled “Spatial Features”, representing the extracted spatial feature vector derived from the graph-based processing of the data.Spatial dimension model. Source(s): Figure created by authors
The pseudocode of TSGCN for anomaly detection is presented in Algorithm 1. During each epoch, the training process is executed, and subsequently, validation is performed. The system saves the model that achieves the best performance on the validation dataset. Upon reaching the maximum epoch, the optimal model is deployed to identify anomaly types within the test set.
Spatial-temporal information fusion for anomaly detection
Input: Multi-modal time series data: video sequence , vibration sequence graph structure data: ,
Output: Anomaly category prediction results
1. Preprocessing stage:
2. Time series are extracted from video and sensors
3. KNN graph structures , are constructed based on signal features
4. The dataset is divided into training, validation, and test sets
5. Training and validation stage:
6. Network parameters are randomly initialized
7. Inputs: hidden layer dimension, learning rate , batch size , number of training epochs
8. for in training epochs do
9. A batch of training data is sampled:
10. Temporal features are extracted using BiLSTM
11. Application of CNN to extract local temporal patterns
12. Encoding time series using BiLSTM
13. Implementation of multi-attention pooling
14. BiLSTM features of video and vibration are concatenated
15. Spatial features are extracted using GCN
16. GCN modules are applied to and respectively
17. Graph features are extracted through global average pooling
18. Graph convolutional features of video and vibration are concatenated
19. Temporal and spatial features are fused
20. BiLSTM and GCN features are concatenated
21. The result is input to a fully connected layer classifier
22. Calculate the crossentropy loss
23. Backpropagate and update the parameters
24. Evaluate model performance on a validation set
25. End
26. Testing stage:
27. Input: test set
28. Output: Anomaly type prediction results
Source(s): Algorithm created by authors
4. Experimental verification
To validate the effectiveness of the TSGCN-based anomaly detection method, experiments were conducted on the elevator door system dataset. To further evaluate its performance, we selected several classical anomaly detection methods as comparison baselines. Meanwhile, the influence of key parameters in TSGCN on the model performance is deeply analyzed.
4.1 Experimental data setup
4.1.1 Data acquisition
The experimental data were collected from elevator doors in normal operation within an apartment building. Faults were artificially introduced to obtain data representing different states, including Normal, Slowdown, Jamming, and Abnormal Door Closing. The elevator doors have a center-opening structure. Vibration data were collected by installing an attitude measurement sensor in the gap between the landing door and the car door on one side of the car. This sensor integrates a three-axis accelerometer, a three-axis gyroscope, a three-axis angle sensor, and a three-axis magnetometer. The sensor supports data storage and export, with an output frequency ranging from 0.2 Hz to 200 Hz, and its sampling frequency is independently configurable. Video data, capturing the opening and closing process, were collected by deploying a camera in the corner of the car. The video frame rate was 30 fps, while the vibration sensor's sampling frequency was set to 50 Hz. Figure 5 shows the installation location of the acquisition equipment.
The left panel labeled “(a)” shows the video data acquisition setup inside an elevator with metallic interior walls. A smartphone mounted on a small holder is attached to the wall and is labeled “Video Collector”, with its screen displaying a recording interface. Above the smartphone near the ceiling, a dome-shaped surveillance camera is visible. The surrounding surfaces appear metallic and reflective, forming the interior structure of the elevator cabin. The right panel labeled “(b)” shows the vibration data acquisition setup near the elevator doorway. The panel displays two metallic sliding door sections identified by the labels “Landing Door” on the left and “Car Door” on the right. Between the door panels, a small rectangular device labeled “Vibration sensor” is mounted vertically along the door frame. Thin annotation lines connect each label to the corresponding component, indicating the sensor placement and the positions of the two doors.Device acquisition diagram. Source(s): Figure created by authors
The left panel labeled “(a)” shows the video data acquisition setup inside an elevator with metallic interior walls. A smartphone mounted on a small holder is attached to the wall and is labeled “Video Collector”, with its screen displaying a recording interface. Above the smartphone near the ceiling, a dome-shaped surveillance camera is visible. The surrounding surfaces appear metallic and reflective, forming the interior structure of the elevator cabin. The right panel labeled “(b)” shows the vibration data acquisition setup near the elevator doorway. The panel displays two metallic sliding door sections identified by the labels “Landing Door” on the left and “Car Door” on the right. Between the door panels, a small rectangular device labeled “Vibration sensor” is mounted vertically along the door frame. Thin annotation lines connect each label to the corresponding component, indicating the sensor placement and the positions of the two doors.Device acquisition diagram. Source(s): Figure created by authors
We simulated various states on the actual elevator door to obtain normal and abnormal data. Specifically, slowing down was simulated by inserting objects between the door and the column to increase the running friction. Abnormal door closing was simulated by placing obstacles between the two doors so that the door could not close. And jamming abnormal data were collected directly from the actual faulty elevator door.
4.1.2 Data processing
Each collected sample contains a complete door opening and closing operation. Vibration data were selected from the X, Y, and Z axes acceleration signals. The X-axis represents the lateral vibration of the elevator door, the Y-axis represents the vertical vibration, and the Z-axis represents the fore-and-aft vibration. The three-dimensional vibration signals is shown in Figure 6.
The twelve panels are arranged in three rows and four columns, grouped into four conditions labeled “(a)”, “(b)”, “(c)”, and “(d)” at the bottom, each containing three line graphs for “Axis X”, “Axis Y”, and “Axis Z”. In all panels, the horizontal axis is labeled “Time step” and ranges approximately from 0 to 600 in (a), (b), and (d) in increments of 100 units and from 0 to 1000 in (c) in increments of 200 units. The vertical axis is labeled “Acceleration (meters per second squared)”. In condition “(a)”, the legend on each plot is labeled “Normal-Axis X”, “Normal-Axis Y”, and “Normal-Axis Z”. The vertical axis ranges from negative 7.5 to 5.0 in increments of 2.5 in the top plot, from negative 4 to 2 in increments of 2 in the middle plot, and from negative 5.0 to 7.5 in increments of 2.5 in the bottom plot. The signals across all three axes show moderate fluctuations around zero with occasional spikes, including a noticeable peak around time step 200 and another cluster of activity near 500 to 600, with brief sharp dips and rises indicating transient motion. In condition “(b)”, the legend on each plot is labeled “Slowdown-Axis X”, “Slowdown-Axis Y”, and “Slowdown-Axis Z”. The vertical axis ranges from negative 10.0 to 5.0 in increments of 2.5 in the top plot, from negative 2 to 6 in increments of 2 in the middle plot, and from negative 10 to 5 in increments of 5 in the bottom plot. The signals display stronger variability compared to normal, with more frequent spikes and wider amplitude changes, including pronounced peaks around time steps 100 to 200 and again near 500 to 600, along with intermittent quieter intervals. In condition “(c)”, the legend on each plot is labeled “Jamming-Axis X”, “Jamming-Axis Y”, and “Jamming-Axis Z”. The vertical axis ranges from negative 2 to 4 in increments of 2 in the top plot, from negative 10 to 5 in increments of 5 in the middle plot, and from negative 4 to 6 in increments of 2 in the bottom plot. The signals show irregular and abrupt bursts with larger amplitude deviations, including sharp spikes and sudden drops, particularly strong negative excursions in the middle plot and clustered oscillations around time steps near 600 to 800, indicating unstable behavior. In condition “(d)”, the legend on each plot is labeled “Abnormal Door Closing-Axis X”, “Abnormal Door Closing-Axis Y”, and “Abnormal Door Closing-Axis Z”. The vertical axis ranges from negative 5 to 7.5 in increments of 2.5 in the top plot, from negative 2 to 6 in increments of 2 in the middle plot, and from negative 5 to 7.5 in increments of 2.5 in the bottom plot. The signals exhibit strong early fluctuations with distinct peaks around time steps 100 to 200, followed by relatively stable segments and later renewed activity near 500, with noticeable spikes and uneven oscillations across all three axes. Note: All numerical data values are approximated.Three-dimensional vibration signals under different states. (a) Normal. (b) Slowdown. (c) Jamming. (d) Abnormal door closing. Source(s): Figure created by authors
The twelve panels are arranged in three rows and four columns, grouped into four conditions labeled “(a)”, “(b)”, “(c)”, and “(d)” at the bottom, each containing three line graphs for “Axis X”, “Axis Y”, and “Axis Z”. In all panels, the horizontal axis is labeled “Time step” and ranges approximately from 0 to 600 in (a), (b), and (d) in increments of 100 units and from 0 to 1000 in (c) in increments of 200 units. The vertical axis is labeled “Acceleration (meters per second squared)”. In condition “(a)”, the legend on each plot is labeled “Normal-Axis X”, “Normal-Axis Y”, and “Normal-Axis Z”. The vertical axis ranges from negative 7.5 to 5.0 in increments of 2.5 in the top plot, from negative 4 to 2 in increments of 2 in the middle plot, and from negative 5.0 to 7.5 in increments of 2.5 in the bottom plot. The signals across all three axes show moderate fluctuations around zero with occasional spikes, including a noticeable peak around time step 200 and another cluster of activity near 500 to 600, with brief sharp dips and rises indicating transient motion. In condition “(b)”, the legend on each plot is labeled “Slowdown-Axis X”, “Slowdown-Axis Y”, and “Slowdown-Axis Z”. The vertical axis ranges from negative 10.0 to 5.0 in increments of 2.5 in the top plot, from negative 2 to 6 in increments of 2 in the middle plot, and from negative 10 to 5 in increments of 5 in the bottom plot. The signals display stronger variability compared to normal, with more frequent spikes and wider amplitude changes, including pronounced peaks around time steps 100 to 200 and again near 500 to 600, along with intermittent quieter intervals. In condition “(c)”, the legend on each plot is labeled “Jamming-Axis X”, “Jamming-Axis Y”, and “Jamming-Axis Z”. The vertical axis ranges from negative 2 to 4 in increments of 2 in the top plot, from negative 10 to 5 in increments of 5 in the middle plot, and from negative 4 to 6 in increments of 2 in the bottom plot. The signals show irregular and abrupt bursts with larger amplitude deviations, including sharp spikes and sudden drops, particularly strong negative excursions in the middle plot and clustered oscillations around time steps near 600 to 800, indicating unstable behavior. In condition “(d)”, the legend on each plot is labeled “Abnormal Door Closing-Axis X”, “Abnormal Door Closing-Axis Y”, and “Abnormal Door Closing-Axis Z”. The vertical axis ranges from negative 5 to 7.5 in increments of 2.5 in the top plot, from negative 2 to 6 in increments of 2 in the middle plot, and from negative 5 to 7.5 in increments of 2.5 in the bottom plot. The signals exhibit strong early fluctuations with distinct peaks around time steps 100 to 200, followed by relatively stable segments and later renewed activity near 500, with noticeable spikes and uneven oscillations across all three axes. Note: All numerical data values are approximated.Three-dimensional vibration signals under different states. (a) Normal. (b) Slowdown. (c) Jamming. (d) Abnormal door closing. Source(s): Figure created by authors
The process of video data processing is as follows. First, each video sample is framed, and the edge position of the door panel in each frame is extracted using edge detection algorithms. Then, the pixel displacement between consecutive frames is calculated and converted to physical speed using a calibrated scale factor. An illustrative example of this process is shown in Figure 7, where raw video frames are transformed into a door displacement curve and subsequently into a speed curve.
The left panel labeled “(a)” shows an elevator entrance framed by metallic sliding doors that are partially open, exposing a light-colored interior wall and a closed gray door with a handle in the background. The door panels have visible vertical seams and reflective surfaces. Two thin vertical reference lines are overlaid near the left and right edges of the doorway, aligned with the door boundaries. On the left door panel, small circular safety icons are arranged vertically, and on the right side, a notice board and additional signage are visible on the wall next to the door frame. The elevator frame and surrounding panels appear smooth and metallic, with straight edges and a rectangular opening. The right panel labeled “(b)” contains a line graph with a legend labeled “Door position”, representing a line. The horizontal axis is labeled “time” and ranges from negative 2 to 16 in increments of 2 units. The vertical axis is labeled “Pixel value” and ranges from 300 to 600 in increments of 50 units. The plotted line begins near 320 around time 0, increases sharply around time 2 to reach 590 near time 4, remains nearly constant close to 590 until about time 10, then decreases rapidly after time 11 and returns to 320 by around time 13, remaining stable afterward. Note: All numerical data values are approximated.Video processing. (a) Edge position of the door. (b) Door displacement curve. Source(s): Figure created by authors
The left panel labeled “(a)” shows an elevator entrance framed by metallic sliding doors that are partially open, exposing a light-colored interior wall and a closed gray door with a handle in the background. The door panels have visible vertical seams and reflective surfaces. Two thin vertical reference lines are overlaid near the left and right edges of the doorway, aligned with the door boundaries. On the left door panel, small circular safety icons are arranged vertically, and on the right side, a notice board and additional signage are visible on the wall next to the door frame. The elevator frame and surrounding panels appear smooth and metallic, with straight edges and a rectangular opening. The right panel labeled “(b)” contains a line graph with a legend labeled “Door position”, representing a line. The horizontal axis is labeled “time” and ranges from negative 2 to 16 in increments of 2 units. The vertical axis is labeled “Pixel value” and ranges from 300 to 600 in increments of 50 units. The plotted line begins near 320 around time 0, increases sharply around time 2 to reach 590 near time 4, remains nearly constant close to 590 until about time 10, then decreases rapidly after time 11 and returns to 320 by around time 13, remaining stable afterward. Note: All numerical data values are approximated.Video processing. (a) Edge position of the door. (b) Door displacement curve. Source(s): Figure created by authors
Based on the extracted door displacement information, the opening and closing running curves of the elevator door are obtained, as shown in Figure 8.
The four panels in a two-by-two grid are labeled “(a)”, “(b)”, “(c)”, and “(d)”, each showing a line graph with a legend labeled “Normal”, “Slowdown”, “Jamming”, and “Abnormal Door Closing”, respectively. In all panels, the horizontal axis is labeled “Time step” and ranges from 0 to 400 in (a) and (b) in increments of 100 units, from 0 to 400 in (c) in increments of 200 units, and from 0 to 300 in (d) in increments of 100 units. The vertical axis is labeled “Velocity (meters per second)” and ranges from negative 2 to 2 in increments of 1 unit in (a), (b), and (d), and from negative 1 to 1 in (c). In panel “(a)”, the velocity increases from near 0 to above 2 in early time steps, then drops to 0 and remains stable before decreasing sharply to around negative 2 near time step 300 and finally returning toward 0. In panel “(b)”, the velocity follows a similar pattern with an initial rise above 2, a flat region at 0, then a drop to around negative 2 after time step 300, followed by a gradual return toward 0. In panel “(c)”, the velocity rises to around 1, stabilizes briefly, then drops to near 0, followed by a gradual decline to around negative 1 near time step 350, and then fluctuates slightly while remaining below 0. In panel “(d)”, the velocity increases to above 2 in the early phase, quickly drops to 0, remains flat for a period, then decreases sharply to near negative 2 around time step 250, and ends with slight fluctuations below 0. Note: All numerical data values are approximated.Elevator door operating curves under different states. (a) Normal. (b) Slowdown. (c) Jamming. (d) Abnormal door closing. Source(s): Figure created by authors
The four panels in a two-by-two grid are labeled “(a)”, “(b)”, “(c)”, and “(d)”, each showing a line graph with a legend labeled “Normal”, “Slowdown”, “Jamming”, and “Abnormal Door Closing”, respectively. In all panels, the horizontal axis is labeled “Time step” and ranges from 0 to 400 in (a) and (b) in increments of 100 units, from 0 to 400 in (c) in increments of 200 units, and from 0 to 300 in (d) in increments of 100 units. The vertical axis is labeled “Velocity (meters per second)” and ranges from negative 2 to 2 in increments of 1 unit in (a), (b), and (d), and from negative 1 to 1 in (c). In panel “(a)”, the velocity increases from near 0 to above 2 in early time steps, then drops to 0 and remains stable before decreasing sharply to around negative 2 near time step 300 and finally returning toward 0. In panel “(b)”, the velocity follows a similar pattern with an initial rise above 2, a flat region at 0, then a drop to around negative 2 after time step 300, followed by a gradual return toward 0. In panel “(c)”, the velocity rises to around 1, stabilizes briefly, then drops to near 0, followed by a gradual decline to around negative 1 near time step 350, and then fluctuates slightly while remaining below 0. In panel “(d)”, the velocity increases to above 2 in the early phase, quickly drops to 0, remains flat for a period, then decreases sharply to near negative 2 around time step 250, and ends with slight fluctuations below 0. Note: All numerical data values are approximated.Elevator door operating curves under different states. (a) Normal. (b) Slowdown. (c) Jamming. (d) Abnormal door closing. Source(s): Figure created by authors
Due to the varying lengths of the collected sample sequences, sequence alignment and padding mechanisms are introduced into the batch processing stage to ensure uniform input dimensions. This leverages LSTM's capability to handle variable-length sequences. Specifically, the samples within each batch are first sorted in descending order according to their sequence lengths. Then, a uniform padding strategy extends all sequences to the length of the longest sequence in the current batch. The padding value is set to zero to avoid interfering with valid features. To preserve the original temporal structure, the actual length information is transmitted to the network as input during sequence packing, eliminating the impact of padding values on model training.
4.2 Experimental results
4.2.1 Evaluation indicators
The vibration and video datasets comprises 802 samples. In accordance with common practice, 60% of the data was randomly selected to form the training set, with 20% allocated to the validation set and 20% to the test set. The specific number of training, validation and test samples are given in Table 1.
Description of elevator door conditions
| Condition | Label | Number of training/validation/testing samples |
|---|---|---|
| Normal | 0 | 120/40/40 |
| Slowdown | 1 | 120/40/40 |
| Jamming | 2 | 120/41/40 |
| Abnormal door closing | 3 | 120/40/41 |
| Condition | Label | Number of training/validation/testing samples |
|---|---|---|
| Normal | 0 | 120/40/40 |
| Slowdown | 1 | 120/40/40 |
| Jamming | 2 | 120/41/40 |
| Abnormal door closing | 3 | 120/40/41 |
Accuracy and F1 score are used as core metrics to evaluate the model performance. Accuracy represents the proportion of samples correctly predicted by the model out of the total samples, reflecting the overall classification accuracy. The F1 score, integrating precision and recall, is suitable for evaluating performance with imbalanced data or in scenarios where misclassification occurs. The F1 score ranges from 0 to 1, with higher values indicating a better balance between precision and recall. The formulations of the evaluation metrics are as follows:
where , , , and represent the numbers of true positive, false positive, true negative, and false negative samples, respectively.
4.2.2 Network structure parameterization
To fully leverage multi-source data features for elevator door anomaly detection, we designed a spatiotemporal network architecture incorporating BiLSTM and GCN. The structure and training parameters of each sub-module are detailed in Table 2 and Table 3.
Structure of the network module
| Module name | Functional | Network architecture |
|---|---|---|
| BiLSTM | Local feature extraction | Conv1d(k = 3, s = 1, p = 1), BatchNorm1d(64), ReLU() |
| MaxPool1d() | ||
| Conv1d(k = 3, s = 1, p = 1), BatchNorm1d(128), ReLU() | ||
| MaxPool1d() | ||
| Contextual relationship | BiLSTM(hidden = 128(Video)/256(Vibrate)) | |
| Focus on important features | Multi-head Attention(num_heads = 4) | |
| GCN | Spatial feature | GCNConv(hidden = 128(Video)/256(Vibrate)), ReLU() |
| GCNConv(), ReLU() | ||
| GCNConv(), ReLU(), Global Average Pooling() | ||
| Characteristic fusion | Multisource spatial-temporal characterization | Concat() |
| Linear() | ||
| Dropout(0.3) | ||
| Linear() |
| Module name | Functional | Network architecture |
|---|---|---|
| BiLSTM | Local feature extraction | Conv1d(k = 3, s = 1, p = 1), BatchNorm1d(64), ReLU() |
| MaxPool1d() | ||
| Conv1d(k = 3, s = 1, p = 1), BatchNorm1d(128), ReLU() | ||
| MaxPool1d() | ||
| Contextual relationship | BiLSTM(hidden = 128(Video)/256(Vibrate)) | |
| Focus on important features | Multi-head Attention(num_heads = 4) | |
| GCN | Spatial feature | GCNConv(hidden = 128(Video)/256(Vibrate)), ReLU() |
| GCNConv(), ReLU() | ||
| GCNConv(), ReLU(), Global Average Pooling() | ||
| Characteristic fusion | Multisource spatial-temporal characterization | Concat() |
| Linear() | ||
| Dropout(0.3) | ||
| Linear() |
Training configuration parameters in the network
| Parameter | Set value |
|---|---|
| Optimizer | AdamW |
| Initial learning rate | 1e−4 |
| Weight decay | 5e−2 |
| Scheduling strategy | ReduceLROnPlateau |
| Loss function | CrossEntropy |
| Epoch | 200 |
| Batch size | 16 |
| Parameter | Set value |
|---|---|
| Optimizer | AdamW |
| Initial learning rate | 1e−4 |
| Weight decay | 5e−2 |
| Scheduling strategy | ReduceLROnPlateau |
| Loss function | CrossEntropy |
| Epoch | 200 |
| Batch size | 16 |
4.2.3 Experimental results
To validate the proposed method, we conducted experiments on a dataset that we collected from real elevator operations, containing both video and vibration signals. The classification results of the method are shown in the confusion matrix in Figure 9. The performance of the test set as follows: out of 161 test samples, 155 fault categories were correctly detected, achieving a detection accuracy of 96.27%, which demonstrates high overall accuracy. Notably, the recognition rate for jamming faults reached 100%. However, misclassification occurred in the normal state (label 0), slowdown (label 1), and abnormal door closing (label 3). Further analysis revealed that while these three states are distinct, they were all artificially simulated on the same elevator. This may have led to pattern overlap in the sensor signals, consequently hindering the model's ability to effectively discriminate features. Subsequent research will focus on introducing more types of sensor data to improve the differentiation ability of similar faults.
The heatmap is titled “Confusion Matrix”, showing a four-by-four grid of values. The horizontal axis is labeled “Predicted” and includes class labels 0, 1, 2, and 3. The vertical axis is labeled “True” and includes class labels 0, 1, 2, and 3. Each cell contains a numeric value representing the count of predictions for each true class. In the first row for true class 0, the values are 38 under predicted 0, 1 under predicted 1, 0 under predicted 2, and 1 under predicted 3. In the second row for true class 1, the values are 2 under predicted 0, 38 under predicted 1, 0 under predicted 2, and 0 under predicted 3. In the third row for true class 2, the values are 0 under predicted 0, 0 under predicted 1, 40 under predicted 2, and 0 under predicted 3. In the fourth row for true class 3, the values are 0 under predicted 0, 2 under predicted 1, 0 under predicted 2, and 39 under predicted 3. The diagonal cells contain the highest values, indicating correct classifications, while the off-diagonal cells contain small values representing misclassifications.Confusion matrix result. Source(s): Figure created by authors
The heatmap is titled “Confusion Matrix”, showing a four-by-four grid of values. The horizontal axis is labeled “Predicted” and includes class labels 0, 1, 2, and 3. The vertical axis is labeled “True” and includes class labels 0, 1, 2, and 3. Each cell contains a numeric value representing the count of predictions for each true class. In the first row for true class 0, the values are 38 under predicted 0, 1 under predicted 1, 0 under predicted 2, and 1 under predicted 3. In the second row for true class 1, the values are 2 under predicted 0, 38 under predicted 1, 0 under predicted 2, and 0 under predicted 3. In the third row for true class 2, the values are 0 under predicted 0, 0 under predicted 1, 40 under predicted 2, and 0 under predicted 3. In the fourth row for true class 3, the values are 0 under predicted 0, 2 under predicted 1, 0 under predicted 2, and 39 under predicted 3. The diagonal cells contain the highest values, indicating correct classifications, while the off-diagonal cells contain small values representing misclassifications.Confusion matrix result. Source(s): Figure created by authors
The features are further visualized using the UMPA method and the results are shown in Figure 10. As the figure shows, the feature distributions for “Normal” and “Slowdown” labels are highly similar and difficult to distinguish accurately, consistent with the confusion matrix results.
The plot displays a scatter distribution of data points grouped into four classes with a legend titled “Class” identifying “Jamming”, “Abnormal Door Closing”, “Slowdown”, and “Normal”. The horizontal axis ranges from negative 10 to 15 in increments of 5 units, and the vertical axis ranges from negative 5 to 15 in increments of 5 units. The points form distinct clusters in different regions of the plot. The “Abnormal Door Closing” cluster is located on the left side around horizontal values near negative 12 and vertical values around 4 to 5, forming a compact group. The “Jamming” cluster appears on the lower right side around horizontal values near 12 to 14 and vertical values around negative 5 to negative 3. The “Slowdown” cluster is positioned in the upper middle-right region around horizontal values near 4 to 5 and vertical values around 8 to 10. The “Normal” cluster is located slightly above and to the right of the slowdown cluster, around horizontal values near 5 to 7 and vertical values around 11 to 14. The clusters are well separated with minimal overlap, indicating a clear distinction among the four classes. Note: All the numerical data values are approximated.UMPA visualization result. Source(s): Figure created by authors
The plot displays a scatter distribution of data points grouped into four classes with a legend titled “Class” identifying “Jamming”, “Abnormal Door Closing”, “Slowdown”, and “Normal”. The horizontal axis ranges from negative 10 to 15 in increments of 5 units, and the vertical axis ranges from negative 5 to 15 in increments of 5 units. The points form distinct clusters in different regions of the plot. The “Abnormal Door Closing” cluster is located on the left side around horizontal values near negative 12 and vertical values around 4 to 5, forming a compact group. The “Jamming” cluster appears on the lower right side around horizontal values near 12 to 14 and vertical values around negative 5 to negative 3. The “Slowdown” cluster is positioned in the upper middle-right region around horizontal values near 4 to 5 and vertical values around 8 to 10. The “Normal” cluster is located slightly above and to the right of the slowdown cluster, around horizontal values near 5 to 7 and vertical values around 11 to 14. The clusters are well separated with minimal overlap, indicating a clear distinction among the four classes. Note: All the numerical data values are approximated.UMPA visualization result. Source(s): Figure created by authors
4.3 Further discussion
4.3.1 Hyperparameter sensitivity analysis
In order to explore the influence of hyperparameters on model performance, we conducted a hyperparameter sensitivity analysis to determine appropriate values. During KNN graph construction, the number of node neighbors (K value) and the distance metric significantly influence the classification results. A K value that is too small results in a sparse graph structure, which, while beneficial for capturing local key point dependencies, may overlook long-range related node information, leading to insufficient information propagation. Conversely, a K value that is too large creates a dense graph structure and increases the information propagation path. This can alleviate the locality limitation of the adjacency structure but may introduce redundant or noisy connections, resulting in over-smoothing or overfitting.
In this study, we consider several commonly used distance metrics to explore their impact on temporal-spatial graph construction, including Euclidean, Manhattan, Chebyshev, and Minkowski distances. Euclidean distance measures the straight-line distance between two points in a multidimensional space and aligns well with the natural continuity of time series data. Manhattan distance, defined as the sum of absolute axial differences, emphasizes edge features and is more sensitive to small perturbations. Chebyshev distance focuses on the maximum coordinate difference, making it suitable for detecting extreme change points. Minkowski distance introduces a tunable parameter , which reduces to Manhattan distance when , Euclidean distance when , and Chebyshev distance as . Therefore, we specifically analyze the effects of Euclidean, Manhattan, and Chebyshev distances on the experimental results. Based on comparative evaluation, Euclidean distance is selected as the primary metric for subsequent graph construction due to its superior performance and consistency with temporal smoothness.
In addition, a single-factor iterative experiment was conducted on the learning rate and dropout using the controlled variable method to analyze the model's sensitivity to key hyperparameters. Only the target hyperparameters are adjusted in each round of experiments, and the rest of the parameters are kept unchanged. Specifically, the settings were as follows: learning rate , and .
The model's performance on the validation set was recorded for each set of parameter configurations, and the results are shown in Figure 11. The optimal hyperparameters selected in the experiments are: , , , and Euclidean distance for the distance metric. As shown in Figure 11(a), when , the graph structure is too sparse due to the insufficient number of neighbors, failing to fully capture the long-range dependencies between nodes. Consequently, information propagation is insufficient, leading to low accuracy. When , the graph structure becomes dense due to the excessive , introducing a large number of redundant and noisy connections, which results in the over-averaging of node features and the destruction of local key information. When , the model partially alleviates the locality limitation by increasing the number of neighbors, and the accuracy improves. For the distance metric in Figure 11(b), Manhattan distance is more sensitive to noise, resulting in a slight decrease in accuracy, while Chebyshev distance performs poorly because it ignores temporal continuity. Regarding the learning rate parameter in Figure 11(c), values that are too small or too large will hinder model performance. A smaller learning rate leads to slow convergence, while a larger one causes the parameters to update excessively, making the model skip the optimal solution and resulting in unstable performance. As for the dropout in Figure 11(d), an excessively high dropout rate causes the model to underfit, leading to a significant loss of information during the entire training process, while a rate that is too low may not be sufficient to prevent overfitting.
The four panels arranged in a two-by-two grid are labeled “(a)”, “(b)”, “(c)”, and “(d)”, each showing a line graph illustrating the sensitivity of different parameters on model accuracy. In all panels, the vertical axis is labeled “Accuracy” and ranges from approximately 0.88 to 0.98 in increments of 0.02 in panels “(a)” and “(b)”, from 0.80 to 1.00, with the intermediate markings at 0.83, 0.85, 0.88, 0.90, 0.93, 0.95, and 0.98 in panel (c), and from 0.88 to 1.00 in increments of 0.02 in panel “(d)”. In panel “(a)” titled “Sensitivity of K in K N N”, the horizontal axis is labeled “K in K N N” and ranges from 1 to 10 in increments of 1 unit. The plotted points fluctuate around 0.90 to 0.96, increasing from about 0.90 at K equals 1 to around 0.945 at K equals 2, decreasing slightly at K equals 3, rising again and reaching the highest value near 0.96 at K equals 5, then dropping to about 0.90 at K equals 7 before gradually increasing toward approximately 0.94 at K equals 10. In panel “(b)” titled “Sensitivity of Distance Metric in K N N”, the horizontal axis is labeled “Distance Metric” and includes three categorical values: “Euclidean”, “Manhattan”, and “Chebyshev”. The plotted values show the highest accuracy near 0.964 for Euclidean, slightly lower near 0.943 for Manhattan, and the lowest around 0.91 for Chebyshev, indicating a decreasing trend. In panel “(c)” titled “Sensitivity of Learning Rate”, the horizontal axis is labeled “Learning Rate” and includes values 10 to the negative 5 power, 5 times 10 to the negative 5 power, 10 to the negative 4 power, 5 times 10 to the negative 4 power, and 10 to the negative 3 power. The plotted accuracy rises from approximately 0.85 at 10 to the negative 5 power to a peak around 0.96 at 10 to the negative 4 power, then decreases to about 0.91 at 5 times 10 to the negative 4 power and further to roughly 0.85 at 10 to the negative 3 power. In panel “(d)” titled “Sensitivity of Dropout”, the horizontal axis is labeled “Dropout Rate” and ranges from 0.1 to 0.5 in increments of 0.1. The plotted values increase from approximately 0.913 at 0.1 to about 0.96 at 0.3, then decrease to around 0.93 at 0.4 before slightly increasing again to near 0.94 at 0.5. Note: All numerical data values are approximated.Accuracy comparison of different hyperparameters. (a) Sensitivity of K in KNN. (b) Effect of distance metric on KNN. (c) Sensitivity of learning rate. (d) Sensitivity of dropout. Source(s): Figure created by authors
The four panels arranged in a two-by-two grid are labeled “(a)”, “(b)”, “(c)”, and “(d)”, each showing a line graph illustrating the sensitivity of different parameters on model accuracy. In all panels, the vertical axis is labeled “Accuracy” and ranges from approximately 0.88 to 0.98 in increments of 0.02 in panels “(a)” and “(b)”, from 0.80 to 1.00, with the intermediate markings at 0.83, 0.85, 0.88, 0.90, 0.93, 0.95, and 0.98 in panel (c), and from 0.88 to 1.00 in increments of 0.02 in panel “(d)”. In panel “(a)” titled “Sensitivity of K in K N N”, the horizontal axis is labeled “K in K N N” and ranges from 1 to 10 in increments of 1 unit. The plotted points fluctuate around 0.90 to 0.96, increasing from about 0.90 at K equals 1 to around 0.945 at K equals 2, decreasing slightly at K equals 3, rising again and reaching the highest value near 0.96 at K equals 5, then dropping to about 0.90 at K equals 7 before gradually increasing toward approximately 0.94 at K equals 10. In panel “(b)” titled “Sensitivity of Distance Metric in K N N”, the horizontal axis is labeled “Distance Metric” and includes three categorical values: “Euclidean”, “Manhattan”, and “Chebyshev”. The plotted values show the highest accuracy near 0.964 for Euclidean, slightly lower near 0.943 for Manhattan, and the lowest around 0.91 for Chebyshev, indicating a decreasing trend. In panel “(c)” titled “Sensitivity of Learning Rate”, the horizontal axis is labeled “Learning Rate” and includes values 10 to the negative 5 power, 5 times 10 to the negative 5 power, 10 to the negative 4 power, 5 times 10 to the negative 4 power, and 10 to the negative 3 power. The plotted accuracy rises from approximately 0.85 at 10 to the negative 5 power to a peak around 0.96 at 10 to the negative 4 power, then decreases to about 0.91 at 5 times 10 to the negative 4 power and further to roughly 0.85 at 10 to the negative 3 power. In panel “(d)” titled “Sensitivity of Dropout”, the horizontal axis is labeled “Dropout Rate” and ranges from 0.1 to 0.5 in increments of 0.1. The plotted values increase from approximately 0.913 at 0.1 to about 0.96 at 0.3, then decrease to around 0.93 at 0.4 before slightly increasing again to near 0.94 at 0.5. Note: All numerical data values are approximated.Accuracy comparison of different hyperparameters. (a) Sensitivity of K in KNN. (b) Effect of distance metric on KNN. (c) Sensitivity of learning rate. (d) Sensitivity of dropout. Source(s): Figure created by authors
4.3.2 Comparison with common single-sensor anomaly detection methods
Currently, most anomaly detection methods rely on single-type sensor signals, especially vibration signals, to determine the operating status of a system. To verify the effectiveness of the proposed multi-source fusion anomaly detection method, we selected single-sensor modeling methods such as LiConvFormer (Yan et al., 2024), TCN (Zhang et al., 2022b), CNN (Liu et al., 2019), BiLSTM (Abebe et al., 2024), and GCN (Chen et al., 2022) as comparison baselines. Each model was trained and evaluated under the same data conditions, and the performance comparison results are shown in Figure 12.
The two side-by-side grouped bar charts are labeled “(a)” and “(b)”. In both charts, the horizontal axis lists the metrics “Accuracy”, “Precision”, “Recall”, and “F 1”, and the vertical axis is labeled “Score” ranging from 0.65 to 1.00 in increments of 0.05. Each metric group contains six bars corresponding to the models “Li Conv Former”, “T C N”, “C N N”, “Bi L S T M”, “G C N”, and “T S G C N”. In panel “(a)” titled “Video Dataset”, the bars show approximate values where Li Conv Former achieves about 0.75 accuracy, 0.77 precision, 0.75 recall, and 0.755 F 1; T C N shows around 0.80 accuracy, 0.81 precision, 0.80 recall, and 0.798 F 1; C N N records about 0.75 accuracy, 0.763 precision, 0.75 recall, and 0.75 F 1; Bi L S T M shows about 0.81 accuracy, 0.817 precision, 0.808 recall, and 0.81 F 1; G C N reaches approximately 0.85 accuracy, 0.86 precision, 0.85 recall, and 0.85 F 1; and T S G C N shows the highest values around 0.875 accuracy, 0.88 precision, 0.875 recall, and 0.875 F 1. In panel “(b)” titled “Vibration Dataset”, the bars indicate Li Conv Former with about 0.775 accuracy, 0.77 precision, 0.775 recall, and 0.77 F 1; T C N with approximately 0.825 accuracy, 0.853 precision, 0.825 recall, and 0.817 F 1; C N N with about 0.82 accuracy, 0.818 precision, 0.82 recall, and 0.818 F 1; Bi L S T M with around 0.925 accuracy, 0.927 precision, 0.925 recall, and 0.925 F 1; G C N with roughly 0.70 accuracy, 0.70 precision, 0.70 recall, and 0.70 F 1; and T S G C N with the highest scores near 0.95 across accuracy, precision, recall, and F 1. Note: All numerical data values are approximated.Performance comparison of anomaly detection models under single-sensor conditions. (a) Video data performance comparison. (b) Vibration data performance comparison. Source(s): Figure created by authors
The two side-by-side grouped bar charts are labeled “(a)” and “(b)”. In both charts, the horizontal axis lists the metrics “Accuracy”, “Precision”, “Recall”, and “F 1”, and the vertical axis is labeled “Score” ranging from 0.65 to 1.00 in increments of 0.05. Each metric group contains six bars corresponding to the models “Li Conv Former”, “T C N”, “C N N”, “Bi L S T M”, “G C N”, and “T S G C N”. In panel “(a)” titled “Video Dataset”, the bars show approximate values where Li Conv Former achieves about 0.75 accuracy, 0.77 precision, 0.75 recall, and 0.755 F 1; T C N shows around 0.80 accuracy, 0.81 precision, 0.80 recall, and 0.798 F 1; C N N records about 0.75 accuracy, 0.763 precision, 0.75 recall, and 0.75 F 1; Bi L S T M shows about 0.81 accuracy, 0.817 precision, 0.808 recall, and 0.81 F 1; G C N reaches approximately 0.85 accuracy, 0.86 precision, 0.85 recall, and 0.85 F 1; and T S G C N shows the highest values around 0.875 accuracy, 0.88 precision, 0.875 recall, and 0.875 F 1. In panel “(b)” titled “Vibration Dataset”, the bars indicate Li Conv Former with about 0.775 accuracy, 0.77 precision, 0.775 recall, and 0.77 F 1; T C N with approximately 0.825 accuracy, 0.853 precision, 0.825 recall, and 0.817 F 1; C N N with about 0.82 accuracy, 0.818 precision, 0.82 recall, and 0.818 F 1; Bi L S T M with around 0.925 accuracy, 0.927 precision, 0.925 recall, and 0.925 F 1; G C N with roughly 0.70 accuracy, 0.70 precision, 0.70 recall, and 0.70 F 1; and T S G C N with the highest scores near 0.95 across accuracy, precision, recall, and F 1. Note: All numerical data values are approximated.Performance comparison of anomaly detection models under single-sensor conditions. (a) Video data performance comparison. (b) Vibration data performance comparison. Source(s): Figure created by authors
The numerical results are shown in Table 4. We can find that the proposed model outperforms other methods in all four metrics. It is worth noting that the proposed scheme also contains BiLSTM and GCN models. As shown in Table 4, the BiLSTM and GCN models yield lower detection results for both video and vibration data compared to the proposed TSGCN method. Since the BiLSTM model only analyzes the temporal dependency, it ignores the spatial structure information contained in the time series data. Conversely, the GCN focuses solely on spatial structure information, lacking the ability to capture sequential dependencies between data points. In contrast to traditional schemes, the proposed TSGCN leverages a wider range of information and can selectively prioritize different data sources.
Performance comparison of anomaly detection models under single sensor conditions
| Methods | Results of the comparison of the two datasets | |||||||
|---|---|---|---|---|---|---|---|---|
| Video data | Vibration data | |||||||
| Accuracy | Precision | Recall | F1 | Accuracy | Precision | Recall | F1 | |
| LiConvFormer | 75.16% | 76.95% | 75.17% | 75.50% | 77.64% | 77.01% | 77.56% | 76.82% |
| TCN | 80.12% | 81.09% | 80.09% | 79.80% | 82.61% | 85.29% | 82.53% | 81.68% |
| CNN | 75.16% | 76.28% | 75.06% | 75.16% | 81.99% | 81.83% | 81.91% | 81.76& |
| BiLSTM | 80.75% | 81.65% | 80.70% | 81.02% | 92.55% | 92.69% | 92.50% | 92.44& |
| GCN | 85.09% | 85.81% | 85.06% | 85.09% | 70.19% | 70.12% | 70.17% | 70.08& |
| TSGCN | 87.58% | 87.97% | 87.55% | 87.53% | 95.03% | 94.99% | 95.00% | 94.97% |
| Methods | Results of the comparison of the two datasets | |||||||
|---|---|---|---|---|---|---|---|---|
| Video data | Vibration data | |||||||
| Accuracy | Precision | Recall | F1 | Accuracy | Precision | Recall | F1 | |
| LiConvFormer | 75.16% | 76.95% | 75.17% | 75.50% | 77.64% | 77.01% | 77.56% | 76.82% |
| TCN | 80.12% | 81.09% | 80.09% | 79.80% | 82.61% | 85.29% | 82.53% | 81.68% |
| CNN | 75.16% | 76.28% | 75.06% | 75.16% | 81.99% | 81.83% | 81.91% | 81.76& |
| BiLSTM | 80.75% | 81.65% | 80.70% | 81.02% | 92.55% | 92.69% | 92.50% | 92.44& |
| GCN | 85.09% | 85.81% | 85.06% | 85.09% | 70.19% | 70.12% | 70.17% | 70.08& |
| TSGCN | 87.58% | 87.97% | 87.55% | 87.53% | 95.03% | 94.99% | 95.00% | 94.97% |
4.3.3 Comparison with non-temporal modeling approaches
To further validate the effectiveness of the spatial-temporal fusion model, we compared the proposed multi-source spatiotemporal graph method with non-spatiotemporal methods that disregard temporal dependencies or spatial structures. These methods typically rely on statistical feature extraction or simple network structures and perform holistic encoding of input signals. They often fail to consider the dynamic evolution of time series or the structural relationships between sequences, which makes it difficult for them to capture key features of fault evolution. The comparative methods include Multilayer Perceptron (MLP) that directly classifies flattened input signals (Rawat et al., 2018), xLSTM enhancing temporal modeling of long sequences with exponential gating and modified memory structure (Beck et al., 2024), mixCNN extracting richer spatial features through a hybrid convolution design with residual connections (Zhao and Jiao, 2023), and ResCISTA-Net extending CISTA by adding residual blocks for better feature extraction (Rao et al., 2024). Table 5 presents their performance comparison.
Comparison of the performance of fusion models and non-spatiotemporal methods in anomaly detection
| Methods | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| MLP | 64.60% | 64.65% | 64.70% | 64.08% |
| xLSTM | 77.64% | 78.88% | 77.56% | 77.80% |
| mixCNN | 74.53% | 74.07% | 74.44% | 73.73% |
| ResCISTA-Net | 65.84% | 65.43% | 65.81% | 65.50% |
| TSGCN | 96.27% | 96.30% | 96.28% | 96.28% |
| Methods | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| MLP | 64.60% | 64.65% | 64.70% | 64.08% |
| xLSTM | 77.64% | 78.88% | 77.56% | 77.80% |
| mixCNN | 74.53% | 74.07% | 74.44% | 73.73% |
| ResCISTA-Net | 65.84% | 65.43% | 65.81% | 65.50% |
| TSGCN | 96.27% | 96.30% | 96.28% | 96.28% |
As shown in Table 5, the TSGCN model achieves the best performance in all four metrics and far outperforms the other compared methods. The MLP model's performance is the poorest, as it completely disregards the signal's time-series structure and spatial correlations. While ResCISTA-Net leverages residual blocks to improve low-level feature extraction, it also neglects temporal and spatial structures, hindering its ability to effectively identify complex fault patterns. xLSTM focuses solely on enhancing the capture of long-sequence time dependencies, without considering the spatial arrangement of data points. On the contrary, mixCNN only extracts spatial features and does not introduce dynamic evolution of time series. Both of them fail to effectively fuse the spatial-temporal synergy information of time series signals, which leads to poor extraction of low-resolution anomaly categories.
4.3.4 Comparison with state-of-the-art spatio-temporal models
To provide a comprehensive comparison with existing spatio-temporal modeling approaches, this paper conducts experiments against several representative advanced methods, including MTGNN (Wu et al., 2020), ASTGNN (Guo et al., 2022), STFGNN (Li and Zhu, 2020) and STSGCN (Sofianos et al., 2021). The above models are all representative graph neural network methods in the fields of multivariate time series modeling and spatio-temporal feature learning in recent years, capable of modeling temporal dependencies and structural correlation information from different perspectives.
Among them, MTGNN adaptively learns the graph structure through the graph learning layer and combines temporal convolution for spatio-temporal modeling, making it an advanced model for multivariate time series prediction. ASTGNN introduces an attention mechanism in the convolution of spatio-temporal graphs to dynamically learn the importance weights between different time steps and spatial nodes. STFGNN designed a graph structure that integrates spatio-temporal information and utilized the parallel GCN module to extract spatio-temporal features respectively. STSGCN constructs local spatiotemporal maps and simultaneously captures local spatiotemporal correlations by using dedicated convolutional modules. All models adopted the same data preprocessing procedures and input features as in this study, and the model hyperparameters were all tuned to achieve their best performance. The comparison results are shown in Table 6.
Performance comparison of state-of-the-art spatio-temporal models for anomaly detection
| Methods | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| MTGNN | 77.02% | 78.03% | 77.02% | 76.13% |
| ASTGNN | 78.88% | 79.30% | 78.88% | 78.97% |
| STFGNN | 77.02% | 80.22% | 77.02% | 75.66% |
| STSGCN | 66.46% | 66.39% | 66.46% | 66.13% |
| TSGCN | 96.27% | 96.30% | 96.28% | 96.28% |
| Methods | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| MTGNN | 77.02% | 78.03% | 77.02% | 76.13% |
| ASTGNN | 78.88% | 79.30% | 78.88% | 78.97% |
| STFGNN | 77.02% | 80.22% | 77.02% | 75.66% |
| STSGCN | 66.46% | 66.39% | 66.46% | 66.13% |
| TSGCN | 96.27% | 96.30% | 96.28% | 96.28% |
As can be seen from Table 6, the TSGCN model we proposed significantly outperforms the other four advanced spatiotemporal graph models in all evaluation indicators. This performance difference indicates that models relying on global or fixed graph structures have limitations in capturing local and transient abnormal patterns existing in the operation of elevator doors. TSGCN can more effectively represent local spatio-temporal patterns by constructing sampler k-nearest neighbor graphs and jointly modeling spatio-temporal dynamic dependencies, thereby significantly improving the performance of anomaly detection.
5. Conclusion
This study proposes a novel multi-source spatial-temporal information fusion model for the accurate recognition of elevator door operation states and anomaly detection. Advanced feature engineering techniques are employed across both temporal and spatial domains to comprehensively capture the temporal dynamics of sensor signals as well as their latent structural correlations. Using a dataset gathered through our own data acquisition of elevator door operations, we conducted experiments to systematically analyze the effects of hyperparameters, including the K value and distance metric used in graph construction, on model performance. Furthermore, the proposed method was compared with traditional single-sensor models, methods lacking spatiotemporal modeling, and several representative state-of-the-art spatiotemporal graph based models. The experimental results demonstrate that the proposed multi-source spatial-temporal fusion model outperforms the comparative methods in accuracy and F1 score, validating the effectiveness and advantages of fusing spatial-temporal structures for complex state recognition. In summary, the spatial-temporal graph neural network-based anomaly detection model for elevator door systems, as developed in this paper, exhibits promising performance and application potential. This is achieved through the integration of multi-source information, spatiotemporal dependencies, and graph structure modeling, which enables more effective characterization of localized abnormal patterns compared with existing spatiotemporal approaches. It offers both theoretical support and a methodological framework for multi-modal abnormal state recognition within elevator systems. Future research will focus on further optimizing the model architecture, enhancing its ability to identify abnormal states in scenarios with limited samples, and exploring the incorporation of a wider range of sensor data to improve the model's generalization and robustness.

