The purpose of the research was to implement a proper algorithm and identify which one would be the most efficient in data analysis for railway infrastructure monitoring.
A couple of algorithms were applied and tested based on the data gathered by the UAV along the railway tracks. The standard methods for assessment were used, and finally, the app was made to show the defined elements of the railway on the map using geolocation.
The YOLO algorithm had the best results (ver 8), and finally it was applied in the system.
The algorithm may be used for a wider purpose if the proper learning methods and tagging of photos and key elements are done. If the UAV delivers an efficient photos (proper speed and distance required) the YOLO algorithm would be an efficient for the track elements identification.
The railway tracks are currently monitored manually by technicians who have to walk along the tracks. Not only is this not a safe solution, but the reliability and efficiency of this approach are poor. Implementing UAV and AI algorithms for elements identification and preventing the track's faults is a promising solution. The main company PKP wants to implement the solution and scale it up for longer distances.
Although the AI methods exist, the biggest challenge is to fit them and tag the photos properly so that the process is done efficiently. Applying UAV and AI together to increase the autonomy level is an original approach.
1. Introduction
In recent years, rapid technological development and increased accessibility, especially in Unmanned Systems (Jeffrey et al., 2023; Ge and Sadhu, 2025), have occurred. The miniaturization of hardware and constant improvements in system efficiency enable the application of UAVs in areas previously inaccessible (Spencer et al., 2019). One of the best examples is using UAVs and monitoring systems to inspect railway tracks, identify infrastructure, and provide information about its condition. Currently, the railway review is often performed manually by technical workers who must do so while traveling along the tracks (Golebiowski and Kukulski, 2020; Burdziakowski et al., 2024). That is inefficient, and most importantly, it is not a safe way to have proper documentation. Due to the lack of repeatability and the inability to maintain the same high inspection quality, the documentation is often not done well enough (Nor´en-Cosgriff et al., 2020). It is imperative to ensure safety and high operational performance. The employee walks or drives along the tracks and photographs the infrastructure. Following that, the assessment of the quality of various elements and the formulation of the report are also performed by hand. Moreover, the data are subjective and depend on the technician's experience. The entire process is time-consuming and costly for the company, as the employee must be paid for several hours or days of work (Wiseman, 2024). The dedicated program is under development at Warsaw University of Technology, Warsaw, Poland: An Innovative Autonomous System for Monitoring the Railway Infrastructure with the use of Artificial Intelligence and Unmanned Aerial Vehicles”. Within the project, a dedicated automated identification system is created. The Polish national Railway Company (PKP PLK) seeks a solution to make the inspection process safer and more automatic, using UAVs, Artificial Intelligence, and machine learning to inspect railway infrastructure. Significant scale problems due to infrastructure faults could be prevented before the error becomes critical, e.g. isolator cracks (see Figure 1).
The purpose of the project is divided into three main steps: developing the proper methodology to take pictures and finding the optimum settings for the UAV and the payload so that the data collected are good enough for the AI algorithms; secondly, the automated indemnification system is developed of most essential elements of the track is developed; finally the docking station is constructed so that the whole operation can be automated with significant personnel reduction (Pineda-Jaramillo et al., 2020). The methodology for proper data acquisition is being implemented within the project, and specialized machine learning algorithms are being applied to enable the automated system to identify infrastructure faults and potential damage. The system's end user selected most of the identified damages, which are mission elements of the infrastructure (isolators, material discontinuities, and missing elements).
The research in the project is done together with Warsaw University of Technology – the best technical university in Poland, with the Faculty of Aeronautical Engineering, SkySnap – one of the most innovative geotech companies, and Enprom – the biggest company that makes UAV inspection of the power lines in Poland and Europe. This article will explore how this process could be semi-automated, decreasing costs and time for the company. In addition, taking pictures from the sky takes considerably less time than a person would typically; the process could be conducted more often, contributing to an overall increase in safety. The existing algorithms are widely used in industry; however, each problem requires the proper, most efficient algorithm to achieve satisfactory results. In the research, two methods have been compared.
2. Literature review
2.1 Theoretical background for computer vision
Vision, both humans' and computers', consists of two major components. The first one is the sensing device. It captures as much information as possible from the environment. People use their eyes to capture light. This information is then passed to the brain using neurons. Cameras capture light and transfer it to the computer through pixels. Machines are far superior at this stage, as modern technology allows for capturing more detailed images from greater distances and at more frequencies (e.g. Infrared). The second one is the interpreting device. This is where the information from the picture is processed, and its meaning is determined. People can do this on multiple different levels in a matter of seconds. Computers are still lagging in this matter. The computer vision (CV) problem, despite significant progress in recent years, remains unsolved. For example, when dealing with a CV problem, each picture can be viewed as a sample consisting of discrete pixels and exemplary features. Depending on the image type, they are:
Either black (0) or white (1) for binary images.
A representation of a grey scale where they range between black (0) and white (255) for instances of grey-scale images.
Multiple color channels for standard images. In the most common RGB color space, each pixel is represented as a 1 × 3 array, where each value ranges from 0 to 255 and corresponds to the intensity of Red, Green, and Blue, respectively.
From the computer's perspective, the images are matrices. Several concepts should first be introduced to understand how “intelligent” connections are made from such a form, as they are universally used in various CV problems.
2.2 Artificial neural networks
An Artificial Neural Network (ANN) is an approach based on the human brain. Multiple neurons are created, and information is passed between each one. A single neuron or perceptron can take several inputs and produce an output based on the weighted summation. The basic workflow is to multiply the input by its associated weight. Then all of those are summed, and the result is passed through an activation function that determines the output. Weights are determined during the testing phase and, most of the time, are generated automatically. Activation functions (Jadon, n.d.) are used to introduce nonlinearity to the neural network system. They decide whether to use the perceptron output and allow the networks to learn more complex functions, which are common in image processing. Most are continuous and differentiable functions (see Figure 2). An ANN is a collection of multiple neurons and activation functions. All layers of perceptions and their connections are considered hidden layers, along with the input and output layers, which together form the entire model (see Figure 3). Most of the time, multiple layers of neurons are used. Outputs from the first one are used as inputs in the subsequent one. This number is one of the hyperparameters used to tune and optimize the simulation. This approach is often called a black-box method, as the algorithm automatically generates all the hidden layers and is next to impossible for a human to interpret. However, one can imagine that building relations between pixels in an image is an abstract concept, so using the white-box method is not an option. The training begins with initializing random values for all the weights in the system. Then, computations are made to determine the outcome, which is later compared to the ground truth. The difference is the system's error (Mapundu et al., 2020). Based on this, the weights are tuned, and the simulation is repeated until the error cannot be reduced further. According to the approximation theorem, such a system can approximate any function (see Figure 3).
In classification problems, the labels are strings. Mathematical operations are converted to vector operations. The principle is that a target variable is 1 when a class is present and 0 elsewhere. For example, in binary classification where a picture contains a cat or a dog, the outputs are [1,0] and [0,1] respectively. This is called one-hot encoding. The SoftMax method can be used for multiclass classification problems. It divides the system's output by the sum of all other values, ensuring the ANN always outputs 1. This way, the model returns a probability distribution for each possible outcome.
2.3 Convolutional Neural Networks
A Convolutional Neural Network (CNN) is a type of ANN specialized in image processing (Author 3, n.d.). Due to the large sizes of digital images, the standard approach would be too computationally intensive, as an extensive number of pixels produces an overwhelming number of connections and weights (in AN, each neuron is connected to each perception in a subsequent layer) (Pranati et al., 2024). CNN solves this problem, significantly reducing the input size and the number of unique connections before performing calculations. Standard architecture consists of three main parts: a convolutional layer, a pooling layer, and a fully connected layer (see Figure 4).
The first one is the most essential part of the solution. It calculates a dot product between two matrices: the kernel and the isolated part of the input image (see Figure 5) —this operation results in an activation map, a smaller but more detailed table. During the procedure, the kernel is moved across the entire height and width of the receptive field. After each iteration, it is displaced by a predefined number of columns/rows. This value is called a stride. For example, if the input is 4 × 4 and the kernel size is 2 × 2 with a stride of 1, the output will be 3 × 3.
A pooling layer is another way of reducing the representation size, used in various places in the CNN algorithm (Pranati et al., 2024). It reduces the number of dimensions by applying a specific statistical operation to each slice of an image representation, one at a time. While there are numerous pooling functions, Max Pooling is the most popular and commonly used (see Figure 6). It returns the maximum value within the given neighborhood, significantly simplifying subsequent calculations while preserving important information.
A fully connected layer (FC) is similar to a classical neural network. Here, all neurons are connected to the former and subsequent layers. This allows computation via matrix multiplication and subsequent bias. The (FC) layer helps translate the input representation into the output.
2.4 Types of computer vision problems
All the methods and terms described before this section are universally applicable when facing a CV problem. However, it is worth noting that this scientific branch has many subcategories, each with a different end goal and thus slightly varying approaches (see Figure 7). Most papers have deep learning at their core (Shanmugamani, 2018).
Difference between image classification, detection, and segmentation
Amid CV problems, image classification plays a prominent role, being undoubtedly crucial in modern technology (Hongruixuan et al., 2024). The algorithm assigns a label or tag to a given picture that involves an object or concept. The training data consists of numerous already-tagged images. Gathering an appropriate set of inputs is usually much more challenging than writing the code itself. This is why programmers tend to use open-source databases when possible. In other cases, the process is exceptionally time-consuming and sometimes expensive when an expert must adequately label a picture. Depending on the desired output, image classification can be divided into four classes:
Binary: Binary classification always has two possible outcomes. When the task involves saying whether there is a cat or a dog in the picture or tagging it as either a person or not a person (Yes/No output), the problem will use a binary solution.
Multiclass: This is very similar to the binary problem; however, there are multiple possible classes. The simulation calculates the probability that the image contains each, then returns the highest one. The handwriting detector, which identifies the most probable letters, is a prime example of a multiclass problem.
Multilabel: As the name suggests, this solution assigns more than one label to a single image. The process is a little more complex than the previous two classifiers. An example of a case where such a case proves helpful is automatic tagging of text that may be about science, travel, or technology.
Hierarchical: This process involves organizing classes into a specific structure. Usually, the labeling process is done twice, once on a broader category and a second time on a narrower subset of the domain determined in a first iteration.
The second most encountered problem in CV is object detection. In its nature, it is very similar to image classification. It is trained on a large, labeled dataset and recognizes given objects. The key difference is that, in addition to tagging the entire image, the algorithm identifies the target object's position and bounds it with a box. Input preparation is even more challenging and time-consuming than in the previous problem. This technology is widely used in autonomous vehicles, for example, to detect pedestrians around them. Image segmentation is closely related to object detection. It is essentially a classification problem, but performed for each pixel. This gives exceptional entity separation within a picture, making this slightly more complex to evaluate than previous processes. This approach plays a crucial role in satellite imagery (Vohra et al., 2023).
2.5 You only look once
The YOLO method was introduced by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi (Redmon et al., 2016). Despite being relatively new, this technology has taken over the CV world and is now one of the most used approaches. YOLO has been made open source, significantly increasing development pace (Jingxin et al., 2025). At its core, this method is relatively simple. It treats object detection more as a regression problem than as classification (Wang et al., 2024). The first step YOLO takes is to divide the input into a grid of equal-sized boxes (Zhu et al., 2024). Then, CNNs are used to estimate the probability that each contains the target object. Squares with zero scores are taken out of consideration at this step. Following that, the regression module is used to predict bounding boxes (Su et al., 2024; Goyal et al., 2022). A pervasive case at this point is the existence of multiple potential grid boxes for a single object. This is where IoU (Intersection over Union) comes in. It represents the ratio of the intersection area (the overlap between the predicted bounding box and the ground truth box) to the union area. 0 indicates no intersection, while 1 represents a perfect intersection. Usually, a threshold is defined at the beginning of the code creation, for example, 0.5, and once it is exceeded, the bounding box is assigned to the object.
Advantages:
Fast, efficient, and provides good generalization
Capable of detecting multiple objects in a single picture
Drawbacks:
Struggles with small object detection
Slightly less precise than region-based models
2.6 R-CNN
Region-based CNN (R-CNN) was developed by Ross Girshick in 2014 as a new approach to object detection. The algorithm's core uses the so-called selective search instead of the exhaustive search used until then. It takes advantage of object segmentation. At first, numerous region proposals are generated (around 2000). Later, CNNs are applied to each region individually to estimate feature representations. This outputs a feature vector for each segment, which is later passed to Support Vector Machines (SVMs). The proposed boxes are then classified, and regression is performed on the bounding boxes to adjust their dimensions to better fit the target object. While accurate, this solution is highly time-consuming and requires extensive storage space. To address those downsides, Fast R-CNN and Faster R-CNN have been introduced. The key difference is that the convolutional feature map is generated during the first step. Following this, the maps are passed to the region proposal network, which produces anchors rather than 2000 individual regions. Those are bounding box centers, each with several boxes attached, differing in aspect ratio and size. They are later passed through a convolutional layer and a regression layer, respectively.
Advantages:
High accuracy, even for small objects
Exceptional performance on complex images with overlapping targets
Drawbacks:
Relatively slow (even Faster R-CNN)
Requires extensive training data to work properly
2.7 SSD
Single Shot MultiBox Selector (SSD) is one of the fastest algorithms available today and excels at real-time operation. Its speed stems from combining classification and localization in a single pass. At the backbone, SSD uses a CNN network, usually pre-trained, as a feature map extractor (Singh, 2023). Multiple feature maps are generated at progressively lower resolutions. The first ones have higher resolutions and capture small objects, while the last ones extract large ones. Each map is divided into cells that contain predefined bounding boxes with various aspect ratios and shapes. Those are then compared with ground-truth boxes to calculate IoU scores. SSD calculates class confidence and box offsets in a single step. Those describe the probability scores for each class, the background class (no detectable object), and the required adjustment vectors for each cell, respectively (ChihShen et al., 2024). As the algorithm generates many overlapping boxes, Non-Maximum Suppression is used to reduce redundancy. It keeps only the bounding boxes with the maximum confidence score for each object (Lekidis et al., 2022). Hard negative mining is also used to increase the accuracy of the training process. As in most cases, background covers a disproportionately larger portion of a picture, to ensure proper learning balance, only the most extreme cases (those with the highest loss) are considered.
Advantages:
Computational speed, allowing for real-time application
Satisfactory performance for all object sizes
Drawbacks:
Accuracy can be lacking compared to two-stage detectors
Complex training process
2.8 WSOD
Weakly Supervised Object Detection (WSOD) is an interesting approach to object detection when labeled training data is unavailable and would be too time-consuming for a programmer to create. Instead of relying on tagged bounding boxes, it uses image-level labels to indicate which objects are present in the picture, without specifying their locations. Most of the time, WSOD uses Multiple Instance Learning (MIL) methods to predict regions that may contain the target. Then, the model determines which boxes are most likely to match the position of labeled objects by maximizing the probability that boxes containing objects are correctly interpreted as such. During training, WSOD outputs pseudo-labels by assigning predicted bounding boxes to objects based on the highest confidence. The learning process is iterative, with the algorithm repeatedly re-evaluating and adjusting. It is common practice to include branches and additional training courses to further increase accuracy.
Advantages:
Requirement for image labels only. No bounding boxes needed
Very stable for changes in dataset sizes
Drawbacks:
Low precision
Struggle when dealing with complex pictures
2.9 Algorithm selection
Real-time computation speed is unnecessary for report generation and can be neglected as a selection criterion. This would lead to YOLO and Faster R-CNN algorithms; however, the latter is replaced with SSD due to hardware limitations. The accuracy drop is not significant; similarly to YOLO, it consists of one layer of calculations. The training process for both processes is comparable as well.
3. Methodology
3.1 Data acquisition
Data has been gathered using one of the most common systems – the DJI M350 model (see Figure 8). This is an off-the-shelf UAV with a payload. See (Author 4, n.d.). The Phase One camera has 100 MPx. It is a low-weight platform for small or UAV-based integrations. First, the dataset of 754 images was analyzed. The iXM-100 is a cutting-edge medium-format sensor featuring backside-illuminated (BSI) technology, which enhances high-light sensitivity and dynamic range. The iXM-100 is a high-productive metric camera featuring a range of specifically designed RSM lenses, available in focal lengths from 35 [mm] to 300 [mm]. With the proper set of distances and the UAV's speed, the sensor could be used efficiently for railway inspection. The data were taken from the 25 [km] section of the railway in Poland. The data were collected using the previously developed methodology, in which the optimal distance and camera settings were determined (see example image in Figure 9). The UAV was flying in accordance with the railway company's regulations. The minimum height was set at 20 [m] above ground. That condition was made unless there were trees close to the tracks, in which case the UAV had to increase its altitude to approximately 2 [m] above the trees (if they exceeded 20 [m]). Not only was the height fixed, but the UAV was also not allowed to fly directly over the tracks, so it was shifted 10 [m] to the side. The camera angle and UAV speed settings have been determined based on the subsequent rounds of iteration. After each data set, the team responsible for data identification provided feedback to the UAV team to ensure optimal data collection (Perry et al., 2020; Ejaz and Choudhury, 2024). Seven hundred fifty-four images have been considered. The original data batches contained more pictures, but those were mostly of catenary poles, where accuracy is already satisfactory, or negatives.
3.2 Dataset and labeling
Data from the UAV flights is raw, meaning it consists only of pictures, none of which are significant, and no tags have been assigned (Kopyt and Rodo, 2024). Much work needs to be done before implementing the algorithm, mainly defining ground-truth bounding boxes, though not exclusively. First, the target classes must be determined. After the general images assessment, the following have been chosen to be evaluated:
Catenary pole: the structure used to support the railway traction lines, which provide power to the trains. One of the most crucial elements of infrastructure
Semaphore: a system of lights used to coordinate railway traffic. Most of the time, it consists of several vertically stacked light signals
Warning shield: it is used to provide information to the driver about the incoming semaphore signal, so that he can adjust velocity accordingly to be able to perform all possible maneuvers
Road lights: a light system used to warn land traffic drivers about the intersection of the road and the tracks
Road barrier: used to block land vehicles from crossing the tracks for safety reasons, to avoid possible collisions
Weights: they are mounted on selected catenary poles to provide the required tension to the traction lines.
To properly define the bounding boxes, which will later be stored in an appropriate format, a labeling tool has to be chosen (Amrani et al., 2020). For this study, Microsoft VoTT was selected because it does not require sending the files to the cloud and can directly output formats compatible with YOLO and SSD. The process consisted of drawing boundary boxes and assigning appropriate tags to each. Additionally, some images with no classes have been included in the training to help identify false positives (see Figures 10 and 11). The process for the whole system identification is as follows: first, the proper data must be gathered. If the data set is complete, the labeling of images is performed. After this process, data quality needs to be checked. If those are correctly set, the normalized data annotations are done, and finally, the algorithm is trained.
All images from the first batch have been assigned proper tags during labeling. However, there is a significant disproportion in the data between the catenary pole and the remaining five classes. This can severely affect the model's accuracy. For this reason, in the remaining batches, images containing only the first class have not been included to build a more balanced training set. Pictures with interesting negatives, like poles in shaded regions or similar structures, have still been considered. With this criterion, 754 images from the original batches have been labeled and will be used in the later stages of the paper. This reduction, apart from increasing.
The model's accuracy will also reduce compilation time. Each bounding box drawn in VoTT created a JSON file containing the data required by the algorithms, with one file per image. This format is neither supported by YOLO nor R-CNN. However, it can be converted to such. This process differs between the two methods and will be described in the following subchapters. Key parameters included in JSON files are:
Corresponding image path
Classes present in the picture
Bounding box coordinates
Bounding box width and height
3.3 Data preparation for YOLO
The YOLO method uses vectors to describe bounding boxes. All values must be relative and are supposed to be in the following form:
A Python script that converts the original JSON files to YOLO files with appropriate vectors in.txt format has been created. Below is a code snippet with a function definition to update the bounding box information.
def convert_to_yolo_fromat(width, height, xmin, ymin, xmax, ymax):
x_center = ((xmin + xmax)/2)/width
y_center = ((ymin + ymax)/2)/height
width = (xmax − xmin)/width
height = (ymax − ymin)/height
return x_center, y_center, width, height
Additionally, VoTT lists classes as strings, while YOLO uses a numerical class representation. For this reason, a dictionary for class mapping has been defined.
class_dict = {
‘catenary_pole’: 0,
‘semaphore’: 1,
‘warning_shield’: 2,
‘road_lights': 3,
‘road_barrier’: 4,
‘weights’: 5}
For many CV algorithms, YOLO included, three data sets are recommended:
Training set: usually 70% of the original. It is passed to the algorithm in the learning phase, when building neural networks
Validation set: 15% of the original data. The algorithm has access to it during the training phase, but does not use it to learn. It evaluates the model's accuracy, tunes hyperparameters, and detects potential overfitting.
Test set: 15% of the original data. The algorithm has access to it only after the model is ready. It uses a testing set to investigate performance on never-seen data
For standard sets of pictures, a random 70-15-15 split is applied. It is worth noting that in object detection problems, both label files and images must be assigned to the same folders, as algorithms scan folders with the same names for files with the same names. As mentioned in the previous subchapter, the class proportion is skewed. Even after filtering out images with only catenary poles, this class still heavily outweighs the others. In such a case, the standard approach to splitting can result in one class not being well represented across all data splits (for example, a semaphore could have 50 occurrences in the training set but only 1 in the validation and test sets). To avoid that, a special approach is introduced. The dictionary is created, with keys being class names and values being all the files where the class occurs. Then, a 70-15-15 random split is performed for each pair, and the resulting splits are appended to global train, val, and test lists. Finally, all of the duplicate file names are removed. This ensures that each class is represented in all sets. The resulting data splits have the following class representation:
Class breakdown for training set:
‘catenary_pole’: 566
‘weights': 82
‘road_lights': 59
‘semaphore’: 52
‘road_barrier’: 38
‘warning_shield’: 35
----------------------------------
Class breakdown for validation set:
‘catenary_pole’: 183
‘weights': 30
‘road_lights': 29
‘semaphore’: 22
‘road_barrier’: 20
‘warning_shield’: 14
----------------------------------
Class breakdown for testing set:
‘catenary_pole’: 220
‘weights': 63
‘road_lights': 50
‘semaphore’: 47
‘road_barrier’: 31
‘warning_shield’: 23
YOLO code is optimized to work on files in resolutions 640/640 or 1,280/1,280. The original data comes in 16k. Using the images without rescaling drastically increases computational time without significantly improving model accuracy. For this reason, all the pictures have been downsized to 1,280 image size. The final preparation step is to create a.yaml file. The model uses it to access appropriate data splits and store class names.
train: “D:/labelled_data/YOLO/images/train.”
val: “D:/labelled_data/YOLO/images/val.”
test: “D:/labelled_data/YOLO/images/test”
nc: 6
names: [‘catenary_pole’, ‘semaphore’, ‘weights',
‘warning_shield’, ‘road_lights', ‘road_barrier’]
The paths to images are specified, while those to labels are not. The code automatically looks for the latter with the appropriate folder structure. The leading directory should include the YAML file and the folders images and labels. Each of those should have train, val, and test folders.
3.4 Data preparation for SSD
The SSD algorithm uses the XML file format to store annotation data. Instead of central coordinates and height and width, it stores information about the bounding box's minimum and maximum values of x and y coordinates. In this format, the y-values are zero at the top of the picture and decrease towards the bottom. The file stores the information about the depth, which is 3 for the standard RGB picture (1 for each color channel). The VoTT JSON output is converted to the desired XML format using simple code and the appropriate Python library. It uses a library-defined ET class to store bounding-box data, and a write function to output XML files. The for loop iterates over all possible bounding boxes, while width, height, and depth correspond to the picture's global dimensions. Since the SSD approach is primarily used for real-time processing, the input image size is significantly smaller than in YOLO, at 300 × 300. The images are resized to this resolution because the model performs best on photos of this resolution. As with YOLO, a random split is not preferred, as it may lead to unsatisfactory class representations. To avoid this, a similar technique is used. The final class breakdown is as follows:
Class breakdown for training set:
‘catenary_pole’: 676
‘weights': 107
‘road_lights': 111
‘semaphore’: 68
‘road_barrier’: 75
‘warning_shield’: 57
----------------------------------
Class breakdown for validation set:
‘catenary_pole’: 150
‘weights': 25
‘road_lights': 17
‘semaphore’: 24
‘road_barrier’: 9
‘warning_shield’: 13
----------------------------------
Class breakdown for testing set:
‘catenary_pole’: 143
‘weights': 26
‘road_lights': 24
‘semaphore’: 11
‘road_barrier’: 18
‘warning_shield’: 10
4. Results
4.1 Performance metrics
To evaluate the model accuracy, three metrics will be used: Mean Average Precision (mAP), Precision, and Recall. Precision is a ratio between the correct predictions and all predicted objects belonging to a given class. Therefore, it indicates how accurate the model's forecasts are. The formula provides it:
Where:
TP – True Positives
FP – False Positives
The higher the precision value, the more accurate the model projections are. This metric is crucial to this thesis because of its relevance to the problem. Recall is the ratio of correct predictions to all predicted objects in a given class. Therefore, it indicates how accurate the model's forecasts are. A formula provides it:
Where:
TP – True Positives
FN – False Negatives
High recall means that most instances of a given target are correctly recognized, making it the most critical metric for the problem at hand. mAP@50 is a Mean Average Precision at Intersection over Union (IoU) threshold set at 0.5. It determines the detection accuracy. To calculate this, the precision is plotted against the recall at the appropriate confidence thresholds, and the area under the curve is calculated for each class, yielding the Average Precision (AP). Finally, it is average across all the classes. For a single class, mAP equals AP. The formulas are as follows:
Where:
Nc – Number of classes
A high mAP@50 indicates that the model accurately detects most objects. This is one of the most important metrics when evaluating the progress of the training process. When it increases, the overall model quality increases as well.
4.2 Results comparison
Applying the YOLO algorithm is straightforward with the Ultralights Python library. After importing the YOLO object, the model must be initialized with the YAML file described in the previous chapter. Following that, the training process can be initiated. The parameters used in the code are explained below:
Data: This parameter points to the yaml file, which is essential for the training process.
Epochs: Specifies how often each image will be analyzed during training. As neural networks are iterative, this value should not be too small. The best practice is to use 100 iterations.
Batch: Determines the batch size, i.e. how many images are passed to the algorithm simultaneously. A batch size that is too large slows down the learning process, while a minimal size increases the duration of the training loop.
Imgsz: Defines the resolution of the input images.
Pretrained: Specifies whether the model is pretrained. As this is the first training session, it is set to False.
The evaluation can be run on the GPU rather than the CPU to significantly speed up the training loop. However, the hardware used for model development was incompatible with this option. The Estimated time for the calculations was approximately 3 days for a non-cutting-edge laptop. After this period, the best-performing model has been saved. Additionally, YOLO provided detailed performance metrics, a significant advantage over other models. A confusion matrix in Figure 12 is an interesting output. The graphs of the Normalized confusion matrices for the validation and test loops are similar to Figure 12 (Jingxin et al., 2025). While it is not a crucial parameter, especially regarding model comparison, it will be discussed only for YOLO. The code outputs those independently, so they do not need to be generated separately. Secondly, presenting some issues and conclusions for the overall CV is sufficient.
As shown, the greatest accuracy is observed for the catenary poles. This is not surprising, as these are the most numerous classes. The weights class performs the worst and is mistaken chiefly for the background. This is because they never occur on their own but rather on the poles, often seen behind them as seen by the camera. This overlap among class objects is problematic in the CV setting and requires extensive class-level data and photo preparation for training. It can also be seen that, across all three data splits, the general behavior is comparable across all classes. The model itself provides accurate results. The aggregated table contains only the metrics mentioned in the first sub-chapter, as they are crucial for comparison with the SSD. The computational time of the ready model on never-seen images is approximately 0.85 s.
The application of SSD is more complex than that of YOLO. First, a custom PascalVOCDataset class has to be defined. It creates an appropriate data structure and attributes for the data loaders to access and pass to the model. The next step is to determine the transform. The data loaders will use this variable to handle the images appropriately. It ensures that the image has the optimal resolution, is converted to a normalized array, and has normalized pixel values of the tensor. The normalized values are the mean and standard deviation, respectively, and are chosen according to the standard problem approach. This operation improves the model learning process. Following that, the model is initialized. Contrary to YOLO, a pretrained one is used. This is standard practice for SSD algorithms to speed up training. Once again, the CPU performs calculations. Some additional method-specific parameters are defined, including the optimizer. This function ensures the model's stability during learning. The learning rate scheduler is the final key parameter to be set before training. Since this model does not return mAP@50 by default during training, a custom function has been defined to print it, as it provides insight into the learning process. Finally, the training loop is described. Several epochs are set to 100 to make a fairer comparison with the YOLO model. The calculations took approximately 1.5 days. Custom functions must compute the results, as they are output in a raw format. The key metrics have been calculated and are presented in the tables (see Tables 1 and 2). The average time to analyze the never-seen picture on a ready model is approximately 0.44 s.
Results table for YOLO
| Class | Train | Validation | Test | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall | mAP@50 | Precision | Recall | mAP@50 | Precision | Recall | mAP@50 | |
| All | 0.792 | 0.723 | 0.793 | 0.749 | 0.764 | 0.806 | 0.676 | 0.689 | 0.716 |
| catenary_pole | 0.961 | 0.989 | 0.992 | 0.933 | 0.989 | 0.993 | 0.902 | 0.959 | 0.974 |
| semaphore | 0.709 | 0.750 | 0.850 | 0.639 | 0.796 | 0.819 | 0.653 | 0.677 | 0.716 |
| warning_shield | 0.960 | 0.818 | 0.856 | 0.948 | 0.818 | 0.900 | 0.886 | 0.826 | 0.890 |
| road_lights | 0.754 | 0.900 | 0.860 | 0.686 | 0.933 | 0.871 | 0.554 | 0.873 | 0.844 |
| road_barrier | 0.667 | 0.500 | 0.630 | 0.624 | 0.571 | 0.638 | 0.452 | 0.500 | 0.415 |
| weights | 0.703 | 0.379 | 0.573 | 0.664 | 0.478 | 0.617 | 0.607 | 0.298 | 0.455 |
| Class | Train | Validation | Test | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall | mAP@50 | Precision | Recall | mAP@50 | Precision | Recall | mAP@50 | |
| All | 0.792 | 0.723 | 0.793 | 0.749 | 0.764 | 0.806 | 0.676 | 0.689 | 0.716 |
| catenary_pole | 0.961 | 0.989 | 0.992 | 0.933 | 0.989 | 0.993 | 0.902 | 0.959 | 0.974 |
| semaphore | 0.709 | 0.750 | 0.850 | 0.639 | 0.796 | 0.819 | 0.653 | 0.677 | 0.716 |
| warning_shield | 0.960 | 0.818 | 0.856 | 0.948 | 0.818 | 0.900 | 0.886 | 0.826 | 0.890 |
| road_lights | 0.754 | 0.900 | 0.860 | 0.686 | 0.933 | 0.871 | 0.554 | 0.873 | 0.844 |
| road_barrier | 0.667 | 0.500 | 0.630 | 0.624 | 0.571 | 0.638 | 0.452 | 0.500 | 0.415 |
| weights | 0.703 | 0.379 | 0.573 | 0.664 | 0.478 | 0.617 | 0.607 | 0.298 | 0.455 |
Results table for YOLO after optimization
| Class | Train | Validation | Test | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall | mAP@50 | Precision | Recall | mAP@50 | Precision | Recall | mAP@50 | |
| All | 0.283 | 0.983 | 0.281 | 0.277 | 0.759 | 0.238 | 0.263 | 0.707 | 0.217 |
| catenary_pole | 0.694 | 0.998 | 0.693 | 0.695 | 0.973 | 0.677 | 0.705 | 0.972 | 0.686 |
| semaphore | 0.222 | 1.000 | 0.222 | 0.333 | 0.875 | 0.292 | 0.191 | 0.818 | 0.156 |
| warning_shield | 0.231 | 1.000 | 0.231 | 0.214 | 0.923 | 0.197 | 0.273 | 0.900 | 0.245 |
| road_lights | 0.226 | 0.955 | 0.216 | 0.194 | 0.706 | 0.137 | 0.179 | 0.625 | 0.112 |
| road_barrier | 0.201 | 1.000 | 0.201 | 0.151 | 0.555 | 0.084 | 0.123 | 0.389 | 0.048 |
| weights | 0.123 | 0.944 | 0.121 | 0.073 | 0.520 | 0.038 | 0.107 | 0.538 | 0.056 |
| Class | Train | Validation | Test | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall | mAP@50 | Precision | Recall | mAP@50 | Precision | Recall | mAP@50 | |
| All | 0.283 | 0.983 | 0.281 | 0.277 | 0.759 | 0.238 | 0.263 | 0.707 | 0.217 |
| catenary_pole | 0.694 | 0.998 | 0.693 | 0.695 | 0.973 | 0.677 | 0.705 | 0.972 | 0.686 |
| semaphore | 0.222 | 1.000 | 0.222 | 0.333 | 0.875 | 0.292 | 0.191 | 0.818 | 0.156 |
| warning_shield | 0.231 | 1.000 | 0.231 | 0.214 | 0.923 | 0.197 | 0.273 | 0.900 | 0.245 |
| road_lights | 0.226 | 0.955 | 0.216 | 0.194 | 0.706 | 0.137 | 0.179 | 0.625 | 0.112 |
| road_barrier | 0.201 | 1.000 | 0.201 | 0.151 | 0.555 | 0.084 | 0.123 | 0.389 | 0.048 |
| weights | 0.123 | 0.944 | 0.121 | 0.073 | 0.520 | 0.038 | 0.107 | 0.538 | 0.056 |
5. Results
When comparing the two approaches, two key factors must be considered: accuracy and the time required for training and per-image computation. The correctness of the predictions is a key aspect of generating a statistical rundown for the company. While the box placement is not a critical parameter, the total number of appropriately recognized instances is the largest. The training time is imperative if the model is reevaluated in the future to increase accuracy, either by tuning hyperparameters or by feeding an enhanced dataset. During numerous report generations, an extensive set of photos would be accumulated, so the possibility of upgrading the model should be considered. When it comes to the processing time of a single image, it is not crucial, as the simulation does not run in real time. For clarity, a comparison table (see Table 3) has been created to display the results of both algorithms. The values were created by subtracting the SSD outcomes from the YOLO ones.
Comparison of the accuracy
| Class | Train | Validation | Test | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall | mAP@50 | Precision | Recall | mAP@50 | Precision | Recall | mAP@50 | |
| All | 0.509 | 0.260 | 0.512 | 0.472 | 0.005 | 0.569 | 0.413 | 0.018 | 0.499 |
| catenary_pole | 0.267 | 0.009 | 0.299 | 0.238 | 0.016 | 0.316 | 0.197 | 0.013 | 0.288 |
| semaphore | 0.487 | 0.250 | 0.628 | 0.306 | 0.079 | 0.527 | 0.462 | 0.141 | 0.56 |
| warning_shield | 0.729 | 0.182 | 0.625 | 0.734 | 0.105 | 0.703 | 0.613 | 0.074 | 0.645 |
| road_lights | 0.528 | 0.055 | 0.644 | 0.492 | 0.277 | 0.734 | 0.375 | 0.248 | 0.732 |
| road_barrier | 0.466 | 0.500 | 0.429 | 0.473 | 0.016 | 0.554 | 0.329 | 0.111 | 0.367 |
| weights | 0.580 | 0.565 | 0.452 | 0.591 | 0.042 | 0.579 | 0.500 | 0.240 | 0.399 |
| Class | Train | Validation | Test | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall | mAP@50 | Precision | Recall | mAP@50 | Precision | Recall | mAP@50 | |
| All | 0.509 | 0.260 | 0.512 | 0.472 | 0.005 | 0.569 | 0.413 | 0.018 | 0.499 |
| catenary_pole | 0.267 | 0.009 | 0.299 | 0.238 | 0.016 | 0.316 | 0.197 | 0.013 | 0.288 |
| semaphore | 0.487 | 0.250 | 0.628 | 0.306 | 0.079 | 0.527 | 0.462 | 0.141 | 0.56 |
| warning_shield | 0.729 | 0.182 | 0.625 | 0.734 | 0.105 | 0.703 | 0.613 | 0.074 | 0.645 |
| road_lights | 0.528 | 0.055 | 0.644 | 0.492 | 0.277 | 0.734 | 0.375 | 0.248 | 0.732 |
| road_barrier | 0.466 | 0.500 | 0.429 | 0.473 | 0.016 | 0.554 | 0.329 | 0.111 | 0.367 |
| weights | 0.580 | 0.565 | 0.452 | 0.591 | 0.042 | 0.579 | 0.500 | 0.240 | 0.399 |
The first thing that stands out is the SSD's superior recall score on the training set. This algorithm falls short in terms of the actual precision of its predictions. Across both the validation and test sets, which are crucial for machine learning model assessment, this discrepancy in recall is greatly diminished, and the models are comparable. Regarding the precision, YOLO proves superior across all the sets. One interesting observation is that while YOLO struggled in the weights class, SSD shows a visibly better recall across all three splits. SSD both trains and processes a single image twice as fast as YOLO. However, the times of the latter are still reasonable for the problem, especially given that they were evaluated on low-end hardware. This algorithm is also significantly more user-friendly. Considering everything, the YOLO model proved superior and will be used in the final code. It has comparable recall on key data splits, better prediction precision, and is easier to use. While it would take longer to improve, the time is still reasonable.
5.1 Final code output
The final step in introducing automation for the process is to provide code to generate a statistical report from the quadcopter reconnaissance photos. As the images include geographical data, an additional output of the interactive map showing the locations of the predictions will be provided. The code takes four inputs:
Photos folder: this is a path to the folder with all the images
Cords: path to the.txt file with geographical data. This does not need to be in the same folder as the images; however, the ID column should correspond to existing image names. Example image 5down0054.JPG corresponds to id 54
Model path: path to the YOLO model
Output folder: path to the folder in which the results will be stored
The output consists of 2 files. The first one is an Excel with the predicted number of occurrences of all instances (see Table 3). Those are a catenary pole, a warning shield, a semaphore, road lights, a road barrier, and weights. The second output is an online HTML map (see Figures 13 and 14) showing the locations of the predictions. This can be opened in any browser. On the virtual map, the user can see all automatically selected objects identified by the algorithms. Such a solution allows for significantly faster identification of the essential elements and dispatching the proper technicians to the indicated location. Each photo includes geolocation, so the precise location is provided. There is a legend that filters can improve clarity. In Figure 13, the user can see the map with the added filters (selected elements, i.e. masts). Each class has its corresponding marker color. An important note is that a small increment has been applied to ease use and make overlapping locations with different classes distinguishable. The naming convention for both outputs is based on the BIOS clock, which is always unique, to avoid accidently overwriting previously conducted analyses.
6. Conclusions and future work
The study demonstrates the potential of a deep learning approach for railway track monitoring, with YOLO outperforming SSD on the provided dataset. This outcome aligns with what both models should, in theory, excel in. The results look promising even without optimizing the model to the best achievable accuracy, despite a significant imbalance in class representation in the training set. The sample code created with the final best model shows results that the monitoring facility could apply in real life. The data gathered by the UAV during multiple flights are now subject to the automated identification system. Due to the limited amount of data, we acknowledge that this limitation and the need to mitigate position imbalance are key directions for future work rather than a resolved outcome. For further work, the model should be hyper-tuned to increase the accuracy. Additionally, it is crucial to feed it more instances of poorly represented classes, such as road barriers and lights. Retraining the algorithm regularly can be done after putting the final code to use with real data. With such an approach, the accuracy across all classes could easily be above 0.9, which in machine learning is beyond satisfactory. The code and model are ready for real-world testing. With more data and feedback, the YOLO model could be improved to the point where it could be applied to a company that cares for the infrastructure. With minimal resulting error, this approach would save time and increase the safety of all railway travel.















