Figure A1 The four panels labeled “(a)”,...

Figure A1

Four charts show played rounds, time consumption, action frequencies, and used action parameters across training episodes.

The four panels labeled “(a)”, “(b)”, “(c)”, and “(d)” are arranged in a two-by-two layout. The charts summarize reinforcement learning episode statistics, including played rounds, computation time, action frequencies, and parameter usage. Panel (a) is titled “Played Rounds per Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The vertical axis is labeled “Total Round” and ranges from 0 to 20 in increments of 2. Blue vertical bars represent the number of rounds played in each episode. Early episodes show low values between 1 and 8 rounds. From episode 6 onward, many episodes reach between 15 and 20 rounds. Several peaks reach the maximum value of 20 rounds, including around episodes 7, 9, 25, 31, 35, 36, 39, 40, 43, 45, 46, 47, 49, and 52. Lower values appear intermittently near episodes 23, 28, 30, 44, 53, 55, 56, and 58. Panel (b) is titled “Time Consumption by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The vertical axis is labeled “Total Time Taken (seconds)” and ranges from 0 to 16000 in increments of 2000. Blue vertical bars represent computation time for each episode. Early episodes mostly remain below 1000 seconds. Larger spikes begin after episode 30. Major peaks occur near episode 31 at about 6400 seconds, episode 35 at about 5600 seconds, episode 40 at about 5200 seconds, episode 44 at 15200 seconds, which is the highest value, episode 46 at about 7800 seconds, episode 47 at about 5200 seconds, episode 51 at about 10100 seconds, episode 52 at about 8800 seconds, and episode 55 at about 4100 seconds. Most remaining episodes remain below 3000 seconds. Panel (c) is titled “Frequency of the Chosen Actions by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 16 in increments of 2. Multiple colored line graphs represent frequencies of actions labeled “Action 0” through “Action 8”. A vertical black line near episode 8 separates the regions labeled “Random Actions” on the left and “Chosen by the P-D Q N Network” on the right. “Action 0”, shown in blue, becomes the dominant action after episode 10 and frequently ranges between 4 and 13 occurrences, with peaks near episodes 38 and 46. “Action 1”, shown in orange, shows several high peaks between episodes 10 and 18, including a maximum near 16 around episode 16, but decreases afterward. “Action 3”, shown in red, fluctuates mostly between 1 and 5 with a large spike near episode 51 reaching 11. The remaining actions, including Actions 2, 4, 5, 6, 7, and 8, generally remain below 5 occurrences across most episodes. Panel (d) is titled “Used Action Parameters”. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 50 in increments of 10. Overlapping histograms display parameter usage frequencies for “Action 0”, “Action 1”, and “Action 3”. Blue bars represent “Action 0”, Yellow bars represent “Action 1”, and red bars represent “Action 3”. The horizontal axis contains two parameter scales. The lower scale labeled “Parameters for Action 0” ranges from 0 to 0.05 in increments of 0.01. The upper scale labeled “Parameters for Action 1 and 3” ranges from 0 to 10 in increments of 1. For “Action 0”, the highest frequency occurs near parameter value 0 with 50 occurrences. Frequencies decrease gradually as parameter values approach 0.05. For “Action 1”, the highest frequency occurs near parameter value 10 with 43 occurrences. For “Action 3”, frequencies are distributed more broadly across the range from 0 to 10, with larger concentrations near parameter values 0 and 10. Note: All numerical data values are approximated.

Statistics from the log file regarding the game round, consumed time, and actions chosen in the experiments

Sharing Unavailable