Figure 1 The conceptual workflow diagram...

Figure 1

A diagram of a multimodal signal-processing framework combining vibration and video signals using a T S G C N architecture.

The conceptual workflow diagram shows a multi-source signal analysis system that combines vibration and video signals using a “T S G C N Architecture” (Temporal–Spatial Graph Convolutional Network) for classification. The diagram is organized into four main sections labeled “Multi-source Signal Acquisition”, “T S G C N Architecture”, “Feature Fusion”, and “Result”. “Multi-source Signal Acquisition”: This section appears on the left side of the diagram and illustrates the input data sources. A smartphone icon indicates a vibration signal acquisition device. Three stacked waveform plots show vibration signals measured along three axes labeled “x-axis”, “y-axis”, and “z-axis”. Each waveform is displayed in a different color: blue for the x-axis, green for the y-axis, and red for the z-axis, representing time-varying vibration amplitudes. Below the vibration signals, a camera icon represents video signal acquisition. A step-like blue waveform labeled “Video Signal” illustrates the temporal representation of video-derived features or motion information extracted from recorded frames. Right-pointing arrows from both vibration and video signals indicate that these data streams are sent to the processing architecture. “T S G C N Architecture”: The central section illustrates the processing architecture labeled “T S G C N Architecture”. It is divided into two main sections: “Temporal Feature Extraction” at the top and “Spatial Feature Extraction” at the bottom. “Temporal Feature Extraction”: This section is enclosed within a dashed boundary and further divided into “Local Feature Extraction” and “Global Feature Extraction”. “Local Feature Extraction”: On the left side, a sequence of vertical feature blocks represents feature maps processed through convolutional operations. Arrows indicate the flow through layers labeled “CONV layer 1”, “Pooling layer 1”, “CONV layer 2”, and “Pooling layer 2”. These layers extract temporal patterns from input signals by progressively reducing dimensionality while preserving important features. “Global Feature Extraction”: On the right side, a graph-based recurrent structure processes temporal dependencies. Nodes labeled “x subscript 1” and “x subscript t” represent input features at different time steps. These connect to intermediate nodes labeled “h subscript 1” and “h subscript t” through weighted edges labeled “w subscript 1”, “w subscript 2”, “w subscript 3”, “w subscript 4”, “w subscript 5”, and “w subscript 6”. Two pathways labeled “Forward layer” and “Backward layer” indicate bidirectional processing of temporal information. Arrows between nodes show how information flows forward and backward across time steps. Output nodes labeled “y subscript 1” and “y subscript t” represent the extracted temporal features after global processing. “Spatial Feature Extraction”: This section appears below and focuses on extracting relationships between features using graph-based methods. Multiple stacked blocks labeled “Graph Convolution” illustrate repeated graph convolution operations. Inside each block, a network of interconnected nodes represents a graph structure where nodes exchange information. Each graph convolution block is followed by layers labeled “Batch Norm” and “R e L U”, indicating normalization and nonlinear activation. On the right side of the spatial section, a module labeled “Graph Readout” aggregates node-level features into a single global representation. A graph with connected nodes is shown, followed by a red circular output node. The aggregation method is labeled “global average pooling”, indicating that features from all nodes are averaged to produce the final output. “Feature Fusion”: The lower-left section shows how features from different modalities are combined. It is divided into two main parts: feature fusion at the top and a neural network classifier at the bottom. On the upper left side, a dashed box labeled “Fusion of Temporal Features” displays two horizontal rows of circular nodes representing temporal feature vectors. The top row is labeled “Vibrate Features” and contains a sequence of circular nodes representing temporal features extracted from vibration signals. The bottom row is labeled “Video Features” and contains a similar sequence of circular nodes representing temporal features extracted from video data. A plus symbol between the two rows indicates that vibration and video temporal features are combined to produce a fused temporal representation. On the upper right side, another dashed box labeled “Fusion of Spatial Features” shows a similar structure. The upper row represents spatial features derived from vibration signals, while the lower row represents spatial features derived from video signals. Each row contains circular nodes representing feature elements. A plus symbol between the rows indicates that the spatial features from both modalities are fused together. Below the two fusion blocks, arrows from both the temporal and spatial fusion outputs converge into a label “Concat”, indicating that the fused temporal and spatial features are concatenated into a single combined feature vector. The concatenated feature vector is passed into a neural network classifier illustrated in the lower section. A vertical column of nodes represents the input feature vector. An arrow labeled “R e L U” indicates the application of the Rectified Linear Unit activation function. The features then pass through a fully connected neural network layer illustrated by multiple nodes connected with lines, representing learned weights between layers. On the right side, a vertical column of nodes labeled “Output Classes” represents the final classification results. Each node corresponds to a predicted class category generated by the model. “Result”: The rightmost section presents the final evaluation results, consisting of two visualizations: a scatter plot of classification outputs and a confusion matrix summarizing model performance. At the top, a scatter plot displays clustered data points representing different classes predicted by the model. The plot includes a legend labeled “Class” with four categories: “Jamming Fault”, “Door Control Fault”, “Slowdown Fault”, and “Normal”. Each class is represented by a distinct color. The data points form four clearly separated clusters in different regions of the plot, indicating strong class separation. One cluster appears on the left side around negative horizontal values, another cluster appears near the upper center, a third cluster is slightly lower but still near the center-right, and a fourth cluster appears on the lower right side. The separation between clusters suggests that the model effectively distinguishes between different fault conditions and normal operation. Below the scatter plot, a matrix labeled “Confusion Matrix” presents classification performance in a grid format. The vertical axis is labeled “True”, and the horizontal axis is labeled “Predicted”, with class indices ranging from 0 to 3. The matrix contains four rows and four columns with numerical values indicating prediction counts: Row 0 (True class 0): 38 correct predictions, with 1 misclassified as class 1 and 1 as class 3. Row 1 (True class 1): 38 correct predictions, with 2 misclassified as class 0. Row 2 (True class 2): 40 correct predictions, with no misclassifications. Row 3 (True class 3): 39 correct predictions, with 2 misclassified as class 1. The diagonal values are high compared to off-diagonal values, indicating strong overall classification accuracy. Misclassifications are minimal and occur only between a few class pairs. Note: All numerical data values are approximated.

Convolutional network fusion model for spatial-temporal maps. Source(s): Figure created by authors

Sharing Unavailable