Figure 2 The four panels labeled “(a)”,...

Figure 2

Four charts show played rounds, time consumption, action frequencies, and used different parameters across training episodes.

The four panels labeled “(a)”, “(b)”, “(c)”, and “(d)” are arranged in a two-by-two layout. The charts summarize reinforcement learning episode statistics, including played rounds, computation time, action frequencies, and parameter usage. Panel (a) is titled “Played Rounds per Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The vertical axis is labeled “Total Round” and ranges from 0 to 30 in increments of 2. Blue vertical bars represent the number of rounds played in each episode. Episodes before about 170 mostly remain below 12 rounds, with a peak near 27 for episode 90. After about episode 180, many episodes rapidly increase and frequently reach between 20 and 30 rounds. From about episode 300 onward, most bars remain near the maximum value of 30 rounds with only occasional drops below 20. Panel (b) is titled “Time Consumption by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The vertical axis is labeled “Total Time Taken (seconds)” and ranges from 0 to 10. Early episodes before about 170 remain close to 0 seconds. Between episodes 180 and 300, the values fluctuate widely between about 1 and 10 seconds. After about episode 300, most episodes stabilize between 8 and 10 seconds with occasional decreases below 5 seconds. Several peaks slightly exceed 10 seconds near episodes 400, 620, and 820. Panel (c) is titled “Frequency of the Chosen Actions by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 30 in increments of 2. Multiple colored line graphs represent frequencies of actions labeled “Action 0” through “Action 8”. A vertical black line near episode 180 separates the regions labeled “Random Actions” on the left and “Chosen by the P-D Q N Network” on the right. “Action 0”, shown in blue, becomes the dominant action after episode 250 and frequently reaches values between 20 and 30 occurrences, with many peaks at the maximum value of 30. “Action 3”, shown in red, shows strong activity between episodes 200 and 260 with peaks between 18 and 24 before decreasing sharply afterward. “Action 7”, shown in gray, becomes prominent between episodes 240 and 300 with frequencies reaching about 22. “Action 8”, shown in yellow-green, briefly rises near episode 190 with frequencies around 15. The remaining actions mostly remain below 5 occurrences across most episodes. Panel (d) is titled “Used Action Parameters”. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 8000 in increments of 2000. Overlapping histograms display parameter usage frequencies for “Action 0”, “Action 3”, and “Action 7”. Blue bars represent “Action 0”, red bars represent “Action 3”, and gray bars represent “Action 7”. The horizontal axis contains two parameter scales. The lower scale labeled “Parameters for Action 0” ranges from 0 to 0.5 in increments of 0.1. The upper scale labeled “Parameters for Action 3 and 7” ranges from 0 to 10 in increments of 1. For “Action 0”, the highest frequency occurs near parameter value 0 with 8000 occurrences. Frequencies near parameter value 0.1 are about 3000, and near parameter value 0.5 are about 3500. For “Action 3”, the highest frequencies occur near parameter values between 0 and 1, with bars reaching about 600 occurrences near 0 and about 300 near 1. Frequencies decrease sharply beyond parameter value 2. For “Action 7”, the largest frequencies also occur near parameter values between 0 and 1, with bars near 0 reaching about 500 occurrences. Very few occurrences appear beyond parameter value 2. Note: All numerical data values are approximated.

Overview of the workflow proposed in this study

Sharing Unavailable