The conventional design and management of urban trees often overlook the benefits of specific canopy shapes, despite their crucial role in enhancing thermal comfort and optimizing direct sunlight utilization. This study presents a novel workflow in which designers define target leaf areas, and a decision-support algorithm guides tree management specialists in regulating growth through branch pruning to meet these targets.
We developed a framework that integrates a tree growth simulation game with a deep reinforcement learning (DRL) network for decision-making. The simulation predicts growth responses to pruning and assesses how closely the resulting structure matches the target leaf area. Based on the current tree state and reward feedback, the DRL network issues pruning decisions. The DRL network learns to optimize pruning strategies by iteratively interacting with the simulation game.
The configured network proved effective in navigating the complex and extensive hybrid decision space associated with tree pruning. It successfully acquired techniques to minimize penalties and consistently achieve relatively high reward scores in the game.
High computational resource consumption remains a significant challenge. Additionally, the reward function lacks clear definitions that consistently guide the model toward the intended design targets.
This work establishes a novel technical pathway for implementing the proposed workflow, employing a voxel approach in the design and management of urban trees. It facilitates multifunctional tree use aligned with explicitly defined design objectives.
Nomenclature
1. Introduction
In light of global warming and the urban heat island effect (Rahman et al., 2020; Wong et al., 2021), there is an increasing need for diverse and multifunctional use of urban trees. Studies have shown that the effect of trees in providing thermal comfort is directly related to their particular crown structure and geometry (Krayenhoff et al., 2014; Oshio et al., 2021). Recent practices in landscape architecture have also experimented with tree planting designs to optimize the period and intensity of indoor sunlight through seasons (Pan and Jakubiec, 2022). A larger crown can better shade the pedestrians and the building façade to reduce extreme sun radiation exposure in summer (Palme et al., 2019). Trees’ crown shapes, in this case, must be site-specific and carefully designed to offer such add-on ecosystem services.
In a traditional urban tree design and management workflow, trees are drawn by landscape architects. Their canopies are commonly illustrated in perfect geometrical forms with a fixed size (see Figure 1a). A young tree newly transplanted from a tree nursery to the site will not reach this size. With limited biomass, it cannot provide many add-on functions (see Figure 1b). For a longer period of tree growth later, arborists (tree management specialists), instead of designers, work hands-on to take care of the trees’ safety and health (see Figure 1c). Therefore, the outcome of urban trees follows mostly rules such as keeping certain heights above vehicle lanes and a certain distance to buildings (see Figure 1d).
The four panels labeled “(a)”, “(b)”, “(c)”, and “(d)” are arranged in a two-by-two layout. The charts summarize reinforcement learning episode statistics, including played rounds, computation time, action frequencies, and parameter usage. Panel (a) is titled “Played Rounds per Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The vertical axis is labeled “Total Round” and ranges from 0 to 20 in increments of 2. Blue vertical bars represent the number of rounds played in each episode. Early episodes show low values between 1 and 8 rounds. From episode 6 onward, many episodes reach between 15 and 20 rounds. Several peaks reach the maximum value of 20 rounds, including around episodes 7, 9, 25, 31, 35, 36, 39, 40, 43, 45, 46, 47, 49, and 52. Lower values appear intermittently near episodes 23, 28, 30, 44, 53, 55, 56, and 58. Panel (b) is titled “Time Consumption by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The vertical axis is labeled “Total Time Taken (seconds)” and ranges from 0 to 16000 in increments of 2000. Blue vertical bars represent computation time for each episode. Early episodes mostly remain below 1000 seconds. Larger spikes begin after episode 30. Major peaks occur near episode 31 at about 6400 seconds, episode 35 at about 5600 seconds, episode 40 at about 5200 seconds, episode 44 at 15200 seconds, which is the highest value, episode 46 at about 7800 seconds, episode 47 at about 5200 seconds, episode 51 at about 10100 seconds, episode 52 at about 8800 seconds, and episode 55 at about 4100 seconds. Most remaining episodes remain below 3000 seconds. Panel (c) is titled “Frequency of the Chosen Actions by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 16 in increments of 2. Multiple colored line graphs represent frequencies of actions labeled “Action 0” through “Action 8”. A vertical black line near episode 8 separates the regions labeled “Random Actions” on the left and “Chosen by the P-D Q N Network” on the right. “Action 0”, shown in blue, becomes the dominant action after episode 10 and frequently ranges between 4 and 13 occurrences, with peaks near episodes 38 and 46. “Action 1”, shown in orange, shows several high peaks between episodes 10 and 18, including a maximum near 16 around episode 16, but decreases afterward. “Action 3”, shown in red, fluctuates mostly between 1 and 5 with a large spike near episode 51 reaching 11. The remaining actions, including Actions 2, 4, 5, 6, 7, and 8, generally remain below 5 occurrences across most episodes. Panel (d) is titled “Used Action Parameters”. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 50 in increments of 10. Overlapping histograms display parameter usage frequencies for “Action 0”, “Action 1”, and “Action 3”. Blue bars represent “Action 0”, Yellow bars represent “Action 1”, and red bars represent “Action 3”. The horizontal axis contains two parameter scales. The lower scale labeled “Parameters for Action 0” ranges from 0 to 0.05 in increments of 0.01. The upper scale labeled “Parameters for Action 1 and 3” ranges from 0 to 10 in increments of 1. For “Action 0”, the highest frequency occurs near parameter value 0 with 50 occurrences. Frequencies decrease gradually as parameter values approach 0.05. For “Action 1”, the highest frequency occurs near parameter value 10 with 43 occurrences. For “Action 3”, frequencies are distributed more broadly across the range from 0 to 10, with larger concentrations near parameter values 0 and 10. Note: All numerical data values are approximated.Scenarios of a possible street tree case
The four panels labeled “(a)”, “(b)”, “(c)”, and “(d)” are arranged in a two-by-two layout. The charts summarize reinforcement learning episode statistics, including played rounds, computation time, action frequencies, and parameter usage. Panel (a) is titled “Played Rounds per Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The vertical axis is labeled “Total Round” and ranges from 0 to 20 in increments of 2. Blue vertical bars represent the number of rounds played in each episode. Early episodes show low values between 1 and 8 rounds. From episode 6 onward, many episodes reach between 15 and 20 rounds. Several peaks reach the maximum value of 20 rounds, including around episodes 7, 9, 25, 31, 35, 36, 39, 40, 43, 45, 46, 47, 49, and 52. Lower values appear intermittently near episodes 23, 28, 30, 44, 53, 55, 56, and 58. Panel (b) is titled “Time Consumption by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The vertical axis is labeled “Total Time Taken (seconds)” and ranges from 0 to 16000 in increments of 2000. Blue vertical bars represent computation time for each episode. Early episodes mostly remain below 1000 seconds. Larger spikes begin after episode 30. Major peaks occur near episode 31 at about 6400 seconds, episode 35 at about 5600 seconds, episode 40 at about 5200 seconds, episode 44 at 15200 seconds, which is the highest value, episode 46 at about 7800 seconds, episode 47 at about 5200 seconds, episode 51 at about 10100 seconds, episode 52 at about 8800 seconds, and episode 55 at about 4100 seconds. Most remaining episodes remain below 3000 seconds. Panel (c) is titled “Frequency of the Chosen Actions by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 16 in increments of 2. Multiple colored line graphs represent frequencies of actions labeled “Action 0” through “Action 8”. A vertical black line near episode 8 separates the regions labeled “Random Actions” on the left and “Chosen by the P-D Q N Network” on the right. “Action 0”, shown in blue, becomes the dominant action after episode 10 and frequently ranges between 4 and 13 occurrences, with peaks near episodes 38 and 46. “Action 1”, shown in orange, shows several high peaks between episodes 10 and 18, including a maximum near 16 around episode 16, but decreases afterward. “Action 3”, shown in red, fluctuates mostly between 1 and 5 with a large spike near episode 51 reaching 11. The remaining actions, including Actions 2, 4, 5, 6, 7, and 8, generally remain below 5 occurrences across most episodes. Panel (d) is titled “Used Action Parameters”. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 50 in increments of 10. Overlapping histograms display parameter usage frequencies for “Action 0”, “Action 1”, and “Action 3”. Blue bars represent “Action 0”, Yellow bars represent “Action 1”, and red bars represent “Action 3”. The horizontal axis contains two parameter scales. The lower scale labeled “Parameters for Action 0” ranges from 0 to 0.05 in increments of 0.01. The upper scale labeled “Parameters for Action 1 and 3” ranges from 0 to 10 in increments of 1. For “Action 0”, the highest frequency occurs near parameter value 0 with 50 occurrences. Frequencies decrease gradually as parameter values approach 0.05. For “Action 1”, the highest frequency occurs near parameter value 10 with 43 occurrences. For “Action 3”, frequencies are distributed more broadly across the range from 0 to 10, with larger concentrations near parameter values 0 and 10. Note: All numerical data values are approximated.Scenarios of a possible street tree case
However, to guide them in providing more ecosystem services, tree crowns must be strategically managed, especially pruned, to approach a design goal. The performance of trees can be quantified from multiple perspectives, such as improving air quality (Vos et al., 2013) and maximizing cooling effects (Grylls and van Reeuwijk, 2021). These simulation tools can also help optimize targets of leaf areas. Designers or decision-makers can work with voxels to plan the targeted crown shapes and density in the 3-D space (Ludwig et al., 2024; Yazdi et al., 2023) (see Figure 1e). In this novel workflow, driving the tree growth towards such design targets is key. An arborist could be advised by a decision support algorithm (see Figure 1g) at any stage of its growth. Following this advice, the outcome of urban trees can better approach what had been designed (see Figure 1h). This vision is impossible with the conventional workflow, where the designs of urban trees and tree management are carried out independently.
Bringing this vision to reality requires robust technical foundations, including (1) advanced tree sensing and modeling (Nitoslawski et al., 2019), (2) target-oriented decision-support algorithms for branch pruning at the single-tree level (Yazdi et al., 2023), and (3) the integration of tree-level decision support to align with urban-scale green infrastructure management strategies, also enabling top-down planning.
Regarding tree sensing and modeling, LiDAR has been proven efficient in sensing the trees and their environment (Abegg et al., 2017; Dan et al., 2012; Shu et al., 2022; Yazdi et al., 2024). Based on detailed scans of trees in the form of point clouds, quantitative structure models of trees (QSMs) enable detailed documentation of branch segments and their topological connections. In the context of numerous open-source methodologies, treeQSM (Raumonen et al., 2013) has demonstrated superior performance in terms of noise resistance and fidelity to reality when compared with AdTree (Du et al., 2019) and AdQSM (Fan et al., 2020). The collected QSM data of trees enable the prediction of resprouting patterns of new shoots following branch pruning (Shu et al., 2024b). Additionally, leaf area density (LAD) within voxel grids can be estimated using this data (Shu et al., 2024a). Despite limitations in these estimation models regarding species and dataset size at their current stages, intensive data about trees will be gained exponentially using remote sensing approaches in the foreseeable future. This also means that deeper and more knowledge of tree growth will be available to designers and decision-makers through simulation software and plug-ins sooner or later.
Regarding the decision-support algorithms for branch pruning at the single-tree level, the primary challenge lies in the vast decision space, particularly regarding time sequences. Given that the number of possible pruning scenarios grows exponentially, it is infeasible to predefine an optimal solution for every situation. Common methods in handling such complex decision space include Monte Carlo Tree Search (MCTS) (Fu, 2016), which reduces computational complexity by employing randomized sampling, and Markov Decision Processes (MDPs) (Puterman, 2014), which provides a mathematical framework in situations where outcomes are partly stochastic and partly controllable. However, both MCTS and MDPs require well-defined state boundaries—for example, a robot’s position within a fixed space or the presence and color of pieces on a Go board. In contrast, tree growth lacks fixed structural rules, making defining such clear state boundaries impractical. The number, connectivity, and arrangement of branches vary dynamically, posing significant challenges for conventional decision-making models. A more suitable approach is training a neural network on a relatively large yet finite set of scenarios with explicitly defined rewards for its decisions. Through this process, this neural network is expected to self-learn and attain robust decisions in other untrained situations to achieve a high reward or minimize the penalty. Typically, such training processes use a “game” (a simulated environment) to generate different situations and rewards to guide the decision-making capability of the neural network (Diallo et al., 2017; Lee et al., 2018; Vinyals et al., 2017). To structure such a neural network model, classical reinforcement learning (RL) models predominantly focus on lower dimensional problems such as tabular data but often face scalability limitations when addressing complex problems such as the decisions in urban tree management. These limitations arise from their reliance on accurate value functions on every tree state (Wang et al., 2024). In contrast, deep neural networks perform better by approximating value functions and policies instead of exhaustively computing them for every state. By integrating deep learning with RL, deep reinforcement learning (DRL) can overcome these shortcomings faced by conventional RL (Mnih et al., 2013). Therefore, DRL is a more suitable approach for decision-making problems in complex environments with high dimensional inputs. Due to these features, DRL has been widely applied across a variety of domains since then, including gaming (Silver et al., 2016; Vinyals et al., 2017), manufacturing (Li et al., 2023), and building energy management (Yu et al., 2021). These applications highlight the versatility and effectiveness of DRL in complex decision-making.
Regarding an urban-scale decision system for managing green infrastructure, digital twins of individual trees, including their detailed representations (Shu et al., 2022) and ecological interactions, serve as a fundamental foundation. Unlike traditional static approaches, a smart green infrastructure management system enables adaptive and responsive decisions (Bittencourt et al., 2024). For instance, urban trees are deeply interconnected with water management, influencing irrigation needs (Silva et al., 2020), reducing runoff, and mitigating flood risks (Dowtin et al., 2023). As precipitation patterns fluctuate, irrigation and rainwater harvesting decisions can be optimized by leveraging real-time data and predictive analytics to optimize resource allocation and respond to these changing conditions (Rambhia et al., 2023). Large-scale tree monitoring originated in forestry research (Holmgren and Thuresson, 1998), primarily for yield estimation and resource management (Helmes and Stockbridge, 2011). However, these methodologies are now being adapted for urban forestry, where monitoring focuses on tree count, height, volume, and ecological contributions (Brandt et al., 2025). Such information is used to inform tree planting to minimize the urban heat island effect (Francis et al., 2023). Despite this application, urban-scale tree monitoring has not yet transitioned into decision-making systems for tree management. Instead, pioneer research has explored using tilt-angle sensors to detect early signs of tree instability before extreme weather events, such as typhoons (Chau et al., 2023). Conversely, in preparation for such events, targeted pruning strategies must be implemented at the individual tree level to reduce the risk of failure and enhance public safety. This necessity has become another driver for automated decision-support algorithms for branch pruning, paving the way toward a responsive and proactive urban forestry management system.
In this context, it is crucial to establish a structured workflow for developing digital tools that can bridge the gap between tree design and management. Pruning is among the most effective methods for shaping tree crowns (Bedker et al., 2012). So far, pruning and growth simulations, such as EduAPPLE (Kohek et al., 2015), have been primarily physiology-based using the L-system (Prusinkiewicz and Lindenmayer, 1996). However, these approaches do not align with the target-oriented design envisioned in this study (see Figure 1). Therefore, this research introduces a novel objective: optimizing branch pruning decisions, marked by cylinders within the QSM of trees, to approach specific crown geometry. Following this aim, this study addresses two key questions: (1) How can such an algorithm be developed? (2) What can be achieved, and what obstacles remain in current technologies? Building on the successes of deep reinforcement learning (DRL) reported in the literature, this study investigated whether DRL could generate optimal branch manipulating decisions for individual trees to approach a design target represented by voxels.
2. Method
The geometric primitives of a tree can be separated for branches, leaves, and roots following Tree Information Modeling (Shu et al., 2022). Among the three organs, growth simulations, especially those following pruning operations, must rely on geometrical and topological representatives of branches. The voxels can describe targeted spatial areas for leaves (Yazdi et al., 2023). The roots are outside the scope of this study due to a lack of validated data and models nowadays. Consequently, QSM and LAD per voxels were defined as primitives for tree state in the workflow experimented below (see “Data Format” in Figure 2).
The four panels labeled “(a)”, “(b)”, “(c)”, and “(d)” are arranged in a two-by-two layout. The charts summarize reinforcement learning episode statistics, including played rounds, computation time, action frequencies, and parameter usage. Panel (a) is titled “Played Rounds per Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The vertical axis is labeled “Total Round” and ranges from 0 to 30 in increments of 2. Blue vertical bars represent the number of rounds played in each episode. Episodes before about 170 mostly remain below 12 rounds, with a peak near 27 for episode 90. After about episode 180, many episodes rapidly increase and frequently reach between 20 and 30 rounds. From about episode 300 onward, most bars remain near the maximum value of 30 rounds with only occasional drops below 20. Panel (b) is titled “Time Consumption by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The vertical axis is labeled “Total Time Taken (seconds)” and ranges from 0 to 10. Early episodes before about 170 remain close to 0 seconds. Between episodes 180 and 300, the values fluctuate widely between about 1 and 10 seconds. After about episode 300, most episodes stabilize between 8 and 10 seconds with occasional decreases below 5 seconds. Several peaks slightly exceed 10 seconds near episodes 400, 620, and 820. Panel (c) is titled “Frequency of the Chosen Actions by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 30 in increments of 2. Multiple colored line graphs represent frequencies of actions labeled “Action 0” through “Action 8”. A vertical black line near episode 180 separates the regions labeled “Random Actions” on the left and “Chosen by the P-D Q N Network” on the right. “Action 0”, shown in blue, becomes the dominant action after episode 250 and frequently reaches values between 20 and 30 occurrences, with many peaks at the maximum value of 30. “Action 3”, shown in red, shows strong activity between episodes 200 and 260 with peaks between 18 and 24 before decreasing sharply afterward. “Action 7”, shown in gray, becomes prominent between episodes 240 and 300 with frequencies reaching about 22. “Action 8”, shown in yellow-green, briefly rises near episode 190 with frequencies around 15. The remaining actions mostly remain below 5 occurrences across most episodes. Panel (d) is titled “Used Action Parameters”. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 8000 in increments of 2000. Overlapping histograms display parameter usage frequencies for “Action 0”, “Action 3”, and “Action 7”. Blue bars represent “Action 0”, red bars represent “Action 3”, and gray bars represent “Action 7”. The horizontal axis contains two parameter scales. The lower scale labeled “Parameters for Action 0” ranges from 0 to 0.5 in increments of 0.1. The upper scale labeled “Parameters for Action 3 and 7” ranges from 0 to 10 in increments of 1. For “Action 0”, the highest frequency occurs near parameter value 0 with 8000 occurrences. Frequencies near parameter value 0.1 are about 3000, and near parameter value 0.5 are about 3500. For “Action 3”, the highest frequencies occur near parameter values between 0 and 1, with bars reaching about 600 occurrences near 0 and about 300 near 1. Frequencies decrease sharply beyond parameter value 2. For “Action 7”, the largest frequencies also occur near parameter values between 0 and 1, with bars near 0 reaching about 500 occurrences. Very few occurrences appear beyond parameter value 2. Note: All numerical data values are approximated.Overview of the workflow proposed in this study
The four panels labeled “(a)”, “(b)”, “(c)”, and “(d)” are arranged in a two-by-two layout. The charts summarize reinforcement learning episode statistics, including played rounds, computation time, action frequencies, and parameter usage. Panel (a) is titled “Played Rounds per Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The vertical axis is labeled “Total Round” and ranges from 0 to 30 in increments of 2. Blue vertical bars represent the number of rounds played in each episode. Episodes before about 170 mostly remain below 12 rounds, with a peak near 27 for episode 90. After about episode 180, many episodes rapidly increase and frequently reach between 20 and 30 rounds. From about episode 300 onward, most bars remain near the maximum value of 30 rounds with only occasional drops below 20. Panel (b) is titled “Time Consumption by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The vertical axis is labeled “Total Time Taken (seconds)” and ranges from 0 to 10. Early episodes before about 170 remain close to 0 seconds. Between episodes 180 and 300, the values fluctuate widely between about 1 and 10 seconds. After about episode 300, most episodes stabilize between 8 and 10 seconds with occasional decreases below 5 seconds. Several peaks slightly exceed 10 seconds near episodes 400, 620, and 820. Panel (c) is titled “Frequency of the Chosen Actions by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 30 in increments of 2. Multiple colored line graphs represent frequencies of actions labeled “Action 0” through “Action 8”. A vertical black line near episode 180 separates the regions labeled “Random Actions” on the left and “Chosen by the P-D Q N Network” on the right. “Action 0”, shown in blue, becomes the dominant action after episode 250 and frequently reaches values between 20 and 30 occurrences, with many peaks at the maximum value of 30. “Action 3”, shown in red, shows strong activity between episodes 200 and 260 with peaks between 18 and 24 before decreasing sharply afterward. “Action 7”, shown in gray, becomes prominent between episodes 240 and 300 with frequencies reaching about 22. “Action 8”, shown in yellow-green, briefly rises near episode 190 with frequencies around 15. The remaining actions mostly remain below 5 occurrences across most episodes. Panel (d) is titled “Used Action Parameters”. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 8000 in increments of 2000. Overlapping histograms display parameter usage frequencies for “Action 0”, “Action 3”, and “Action 7”. Blue bars represent “Action 0”, red bars represent “Action 3”, and gray bars represent “Action 7”. The horizontal axis contains two parameter scales. The lower scale labeled “Parameters for Action 0” ranges from 0 to 0.5 in increments of 0.1. The upper scale labeled “Parameters for Action 3 and 7” ranges from 0 to 10 in increments of 1. For “Action 0”, the highest frequency occurs near parameter value 0 with 8000 occurrences. Frequencies near parameter value 0.1 are about 3000, and near parameter value 0.5 are about 3500. For “Action 3”, the highest frequencies occur near parameter values between 0 and 1, with bars reaching about 600 occurrences near 0 and about 300 near 1. Frequencies decrease sharply beyond parameter value 2. For “Action 7”, the largest frequencies also occur near parameter values between 0 and 1, with bars near 0 reaching about 500 occurrences. Very few occurrences appear beyond parameter value 2. Note: All numerical data values are approximated.Overview of the workflow proposed in this study
A theoretical framework was developed using the current tree state of branches in QSM and the targeted state of leaves in voxels as two initial inputs (illustrated in Figure 2 left). It was used to train a decision mechanism for pruning decisions through DRL. The framework comprises two main sectors: a tree growth game (see section 2.1) and a DRL Model (see section 2.2). The tree growth game deals with an iterative growth simulator of trees, allowing input for pruning decisions based on QSM and predicting the LAD in a future state for each turn. The DRL model is a machine player of the game. It chooses from preset pruning strategies and evaluates their effects by comparing the received LAD results from the game with the target leaf voxels.
In this workflow, each function is highly modular. Therefore, all the models in the current implementation (see “Applied Model” in Figure 2) can be replaced or revised with other models with equivalent effects.
2.1 The tree growth game
As shown in the red section in Figure 2, every episode of the tree growth game starts with the initial state of a tree. In the user interface of this game, a target leaf voxel is shown next to the current leaf voxel state for pruning decisions. In our experiments, no specific design was made to set those targeted leaf voxels. Such a design is supposed to be site-specific, considering multifunctional performances, including ecosystem services and aesthetic value. How these target leaf voxels are decided is irrelevant to training the decision-making algorithm in this study. To bypass this, a cluster of voxels was randomly populated at the upper parts of the voxel space as virtual targets for this game (see Figure 3c). For each episode, the targets are different. After setting up an initial tree state and a target state, the tree’s growth consists of two simulation components: the tree’s particular reaction to exogenous stimuli and a generic growth under given environmental conditions. The game’s objective is to select optimal pruning strategies over a time sequence to progressively align the leaf voxel state shown in Figure 3b with the target state depicted in Figure 3c.
The two horizontal sections show sequential rounds of voxelized canopy growth represented in three-dimensional coordinate plots. Each panel is labeled “Leaf State in L A D” and displays a green voxel canopy structure inside a three-dimensional coordinate grid. Below each plot are corresponding “Score”, “Action”, and “Parameter” values. The final panel on the lower-right is labeled “Target” and shows the target voxel canopy configuration. The top section displays rounds 0 through 14 at intervals of two. Round 0: The voxel canopy is very small and compact near the center of the coordinate space. Score: 0.1. Round 2: The voxel canopy becomes slightly larger and wider with a rounded triangular shape. Score: 0.1. Round 4: The canopy expands further upward and outward, forming a denser mound-like structure. Score: 0.1. Round 6: The voxel canopy increases in width and height with a smoother dome-like form. Score: 0.1. Round 8: The canopy becomes denser and more elevated with a broad rounded surface. Score: 0.14. Round 10: The canopy enlarges significantly and forms a larger hemispherical structure. Score: 0.82. Round 12: The voxel structure becomes taller and fuller with increased density across the canopy surface. Score: 2.15. Round 14: The canopy grows into a broad dome-like structure occupying most of the coordinate space. Score: 4.48. The lower section displays rounds 16 through 30 and the target state. Round 16: The canopy becomes larger and more compact with a smoother curved upper surface. Score: 7.81. Round 18: The voxel canopy continues expanding upward and outward with increased density. Score: 10.84. Round 20: The canopy develops into a larger dome-like structure with broad horizontal spread. Score: 14.20. Round 22: The voxel canopy becomes taller and denser with a rounded upper surface. Score: 15.87. Round 24: The canopy reaches one of the densest and largest states in the sequence. Score: 15.90. Round 26: The canopy remains broad and dense with slight irregularities along the upper edge. Score: 14.73. Round 28: The voxel structure becomes slightly flatter along the top while maintaining a large spread. Score: 13.13. Round 30: The canopy remains large and dense with a rounded upper form occupying most of the coordinate space. Score: 11.99. The “Target” panel displays a compact rounded voxel canopy occupying a smaller region near the center-left of the coordinate grid. The Action and Parameter are labeled below a rightward curved arrow between two consecutive plots as follows: Between rounds 0 and 2: Action: 0. Parameter: 0.035 and 0.1. Between rounds 2 and 4: Action: 0. Parameter: 0.15 and 0.4. Between rounds 4 and 6: Action: 0. Parameter: 0.3 and 0.2. Between rounds 6 and 8: Action: 0. Parameter: 0.05 and 0.45. Between rounds 8 and 10: Action: 0. Parameter: 0.5 and 0.45. Between rounds 10 and 12: Action: 0. Parameter: 0.25 and 0.5. Between rounds 12 and 14: Action: 0. Parameter: 0.15 and 0.05. Between rounds 14 and 16: Action: 0. Parameter: 0.05 and 0.1. Between rounds 16 and 18: Action: 0. Parameter: 0.05 and 0.5. Between rounds 18 and 20: Action: 0. Parameter: 0.05 and 0.05. Between rounds 20 and 22: Action: 0. Parameter: 0.05 and 0.05. Between rounds 22 and 24: Action: 0. Parameter: 0.4 and 0.05. Between rounds 24 and 26: Action: 0. Parameter: 0.05 and 0.05. Between rounds 26 and 28: Action: 0. Parameter: 0.05 and 0.05. Between rounds 28 and 30: Action: 0. Parameter: 0.05 and 0.05. Curved arrows between consecutive rounds indicate the progression sequence from one canopy state to the next.Plotted tree states displayed to players in an example episode: (a) Initial branch state in QSM of a young plane tree; (b) Estimated LAD of this young plane tree at its initial state; (c) A randomly populated target LAD in the game
The two horizontal sections show sequential rounds of voxelized canopy growth represented in three-dimensional coordinate plots. Each panel is labeled “Leaf State in L A D” and displays a green voxel canopy structure inside a three-dimensional coordinate grid. Below each plot are corresponding “Score”, “Action”, and “Parameter” values. The final panel on the lower-right is labeled “Target” and shows the target voxel canopy configuration. The top section displays rounds 0 through 14 at intervals of two. Round 0: The voxel canopy is very small and compact near the center of the coordinate space. Score: 0.1. Round 2: The voxel canopy becomes slightly larger and wider with a rounded triangular shape. Score: 0.1. Round 4: The canopy expands further upward and outward, forming a denser mound-like structure. Score: 0.1. Round 6: The voxel canopy increases in width and height with a smoother dome-like form. Score: 0.1. Round 8: The canopy becomes denser and more elevated with a broad rounded surface. Score: 0.14. Round 10: The canopy enlarges significantly and forms a larger hemispherical structure. Score: 0.82. Round 12: The voxel structure becomes taller and fuller with increased density across the canopy surface. Score: 2.15. Round 14: The canopy grows into a broad dome-like structure occupying most of the coordinate space. Score: 4.48. The lower section displays rounds 16 through 30 and the target state. Round 16: The canopy becomes larger and more compact with a smoother curved upper surface. Score: 7.81. Round 18: The voxel canopy continues expanding upward and outward with increased density. Score: 10.84. Round 20: The canopy develops into a larger dome-like structure with broad horizontal spread. Score: 14.20. Round 22: The voxel canopy becomes taller and denser with a rounded upper surface. Score: 15.87. Round 24: The canopy reaches one of the densest and largest states in the sequence. Score: 15.90. Round 26: The canopy remains broad and dense with slight irregularities along the upper edge. Score: 14.73. Round 28: The voxel structure becomes slightly flatter along the top while maintaining a large spread. Score: 13.13. Round 30: The canopy remains large and dense with a rounded upper form occupying most of the coordinate space. Score: 11.99. The “Target” panel displays a compact rounded voxel canopy occupying a smaller region near the center-left of the coordinate grid. The Action and Parameter are labeled below a rightward curved arrow between two consecutive plots as follows: Between rounds 0 and 2: Action: 0. Parameter: 0.035 and 0.1. Between rounds 2 and 4: Action: 0. Parameter: 0.15 and 0.4. Between rounds 4 and 6: Action: 0. Parameter: 0.3 and 0.2. Between rounds 6 and 8: Action: 0. Parameter: 0.05 and 0.45. Between rounds 8 and 10: Action: 0. Parameter: 0.5 and 0.45. Between rounds 10 and 12: Action: 0. Parameter: 0.25 and 0.5. Between rounds 12 and 14: Action: 0. Parameter: 0.15 and 0.05. Between rounds 14 and 16: Action: 0. Parameter: 0.05 and 0.1. Between rounds 16 and 18: Action: 0. Parameter: 0.05 and 0.5. Between rounds 18 and 20: Action: 0. Parameter: 0.05 and 0.05. Between rounds 20 and 22: Action: 0. Parameter: 0.05 and 0.05. Between rounds 22 and 24: Action: 0. Parameter: 0.4 and 0.05. Between rounds 24 and 26: Action: 0. Parameter: 0.05 and 0.05. Between rounds 26 and 28: Action: 0. Parameter: 0.05 and 0.05. Between rounds 28 and 30: Action: 0. Parameter: 0.05 and 0.05. Curved arrows between consecutive rounds indicate the progression sequence from one canopy state to the next.Plotted tree states displayed to players in an example episode: (a) Initial branch state in QSM of a young plane tree; (b) Estimated LAD of this young plane tree at its initial state; (c) A randomly populated target LAD in the game
2.1.1 The first tree growth game for training
In the first training version, we randomly picked one of seven young plane trees scanned in a tree nursery (see Figure 3a) at the start of each episode. They had almost the same diameters at breast height (DBH), crown diameter, and height but not the same branches. Accordingly, the LAD distribution of this tree was estimated by allocating cylinders from QSM to individual voxels (Shu et al., 2024a). The reaction to stimuli refers to the resprouting of new shoots following branch pruning. This process requires input from the pruning locations on branches in the QSM.
The decision of where to prune is the game’s most impactful operation for tree growth. Technically, marking any branch node as a pruning point is possible. However, when the game offers such enormous decision-making freedom, it would be too much effort for a human player to click on individual branches to prune them every turn. Meanwhile, a machine player would consume too much computational capacity to test different input combinations. To tackle this problem, four typical pruning strategies in gardening practice were predefined in the game: thinning, raising, reduction, and topping (Clark and Matheny, 2010; Speak and Salbitano, 2023). Each pruning strategy allows users to enter further parameters to specify the operation. The thinning strategy prunes branches whose minimum distance to other branches is less than a given threshold (see Figure 4a). The raising strategy prunes branches below a given height above the crown start (see Figure 4b). The reduction strategy prunes branches from the furthest stretch of a given direction (from west, east, north, south, or top) until a given distance (see Figure 4c). The topping strategy prunes branches within a given depth of every branch’s fine end (see Figure 4d). Besides these four pruning operations, the game allows the player to take no action in a round or manually end the episode in the operation phase.
The four side-by-side tree diagrams are labeled “(a)”, “(b)”, “(c)”, and “(d)” beneath each panel. Each panel shows a leafless tree structure with branches color-coded into red “Cut Branches” and gray “Kept Branches”. A legend appears below each diagram indicating the two branch categories. Panel (a) is titled “Thinning” with the subtitle “minimum distance 0.05 meters”. Numerous interior and overlapping branches throughout the tree canopy are highlighted in red to indicate branches removed during thinning. Gray branches remain distributed throughout the structure. Panel (b) is titled “Raising” with the subtitle “raised height 0.4 meters”. Two semi-transparent horizontal planes appear near the base of the canopy, with a vertical dimension marker labeled “0.4 meters”. Lower branches beneath the raised height threshold are highlighted in red, while upper branches remain gray. Panel (c) is titled “Reduction” with the subtitle “distance 0.8 meters from west”. Two semi-transparent vertical planes are positioned on the left side of the tree with a horizontal dimension marker labeled “0.8 meters”. Branches extending beyond the reduction boundary toward the left side are shown in red, while interior branches remain gray. Panel (d) is titled “Topping” with the subtitle “cylinder depth 3”. A semi-transparent polygonal or cylindrical boundary surrounds the upper canopy region. Branches extending beyond the upper boundary are highlighted in red, indicating topped branches, while interior branches are retained in gray.Four predefined pruning strategies for the players to choose from
The four side-by-side tree diagrams are labeled “(a)”, “(b)”, “(c)”, and “(d)” beneath each panel. Each panel shows a leafless tree structure with branches color-coded into red “Cut Branches” and gray “Kept Branches”. A legend appears below each diagram indicating the two branch categories. Panel (a) is titled “Thinning” with the subtitle “minimum distance 0.05 meters”. Numerous interior and overlapping branches throughout the tree canopy are highlighted in red to indicate branches removed during thinning. Gray branches remain distributed throughout the structure. Panel (b) is titled “Raising” with the subtitle “raised height 0.4 meters”. Two semi-transparent horizontal planes appear near the base of the canopy, with a vertical dimension marker labeled “0.4 meters”. Lower branches beneath the raised height threshold are highlighted in red, while upper branches remain gray. Panel (c) is titled “Reduction” with the subtitle “distance 0.8 meters from west”. Two semi-transparent vertical planes are positioned on the left side of the tree with a horizontal dimension marker labeled “0.8 meters”. Branches extending beyond the reduction boundary toward the left side are shown in red, while interior branches remain gray. Panel (d) is titled “Topping” with the subtitle “cylinder depth 3”. A semi-transparent polygonal or cylindrical boundary surrounds the upper canopy region. Branches extending beyond the upper boundary are highlighted in red, indicating topped branches, while interior branches are retained in gray.Four predefined pruning strategies for the players to choose from
After the pruning operation, a set of random starting positions of cylinders were selected as buds that produce new shoots. These positions kept a minimum of 4 cylinders between each other. Based on these positions, the shooting angle, length, and radius are given within defined domains.
Regarding the generic growth component, we performed an L-system-like growth simulation (Prusinkiewicz and Lindenmayer, 1996). Each cylinder in QSM is considered a node in an L-system, which can have independent elongation in length, increment in radius, and death when experiencing competition (Hemmerling et al., 2008). Such growth simulations have been well explored in functional structural plant models (FSPMs), ranging in different scales and integrating various physiological perspectives (Louarn and Song, 2020). FSPMs can simulate the tree growth closer to reality by inputting detailed environmental parameters like lighting, soil water content, etc. However, these can increase the computational load to train the decision-making model. So, in the first version, the generic growth consisted of only three steps: (1) extending the shoots with new cylinders on their tops; (2) increasing cylinder radius in a reverse proportion to their branch order, where the annual DBH increment was in reference to Dervishi et al. (2022), (3) every branch that grows close to another branch of a larger or similar size has a chance of being deleted. This chance of death due to competition increases with the branching order, so the main trunk becomes more competitive than the sub-branches. Following these three steps, the game produces the future branch state in QSM after growth.
Based on this QSM, the new leaf state in LAD voxel values is estimated (Shu et al., 2024a). The voxel size was 0.8 m on each side to align with the LAD estimation model. The total voxel space consists of 20 × 20 × 26 voxels. So, the whole voxel space has a dimension of over 16 × 16 × 20 meters. A common urban tree grown in urban areas would not exceed this size. It should be noted that LAD values in the voxels are continuous numbers. To acquire a clearer visualization for human players, a visible threshold was set to 0.05 m2/m3 to emphasize the major crown geometry (see Figure 3b). These results are updated in the user interface.
2.1.2 A further simplified version of the tree growth game
In the second round of simplification, the branch primitives in QSM were completely removed to reduce storage consumption further. The leaf primitives were also simplified from LAD values for each voxel to binary Boolean values. The voxels with true values were solid voxels with leaves, while those with false values were void voxels without leaves. Besides, the initial leaf state and target in voxels were fixed to only one configuration for all the episodes.
The pruning actions remained the same. But instead of affecting branches, these actions directly delete the solid voxels based on the given depth of the voxel number or a deleting rate (see appendix table A1). Without regrowth of the shoot after pruning, the cause and outcome between a prune operation and the deletion in solid voxels were expected to be clearer. The generic growth of the crown was redrafted accordingly: a solid voxel would expand one voxel wide upward and toward the 4 horizontal directions.
Finally, the game determines if it goes for another iteration or ends this episode depending on trigger conditions. Typical trigger conditions are reaching certain similarities to the target, reaching a maximum round number limit, or receiving a stop command from the player.
2.2 Decision-making in the tree growth game with DRL
As shown in Figure 5, in reinforcement learning, a game consists of (1) an agent, representing the player, (2) the state of the agent (noted as , where is the number of game rounds), (3) an environment where the agent operates, (4) actions available to the agent (noted as ), and (5) rewards or penalties given to the player for completing or failing certain tasks (noted as ). The final decision is made by selecting the action with the highest Q-value (noted as ), corresponding to the current state and every possible action . The tree growth simulation serves as the environment in this system that provides feedback on the subsequent state of the tree based on the current state and the input action. Instead of a human player, a neural network is the agent that estimates the Q-value using the Bellman equation (Watkins and Dayan, 1992) for every possible state-action pair. These estimates are based on the previously received state and reward as input.
The diagram presents a workflow connecting a “Tree Growth Game” environment with a “D R L Model” for evaluating tree maintenance strategies. The diagram is divided into two large colored regions. The left pink region is labeled “Tree Growth Game” and the right blue-gray region is labeled “D R L Model”. On the left side, a rounded rectangular panel titled “Environment” contains a three-dimensional visualization in a three-dimensional coordinate plane labeled “Tree Growth Simulation”. A rightward arrow points to a curved panel labeled “Future State of The Tree” under “D R L Model”. Inside the curved panel labeled “Future State of The Tree” is a green voxel-style three-dimensional tree structure. Purple text below the panel labels it as “State S subscript t plus 1”. To the lower-right side of this panel is another curved white panel labeled “Targeted State of the Tree”. Inside this panel is a flattened green voxel-style tree canopy structure. The purple text below labels it as “Goal S subscript T A R”. Between the two state panels is a circular comparison symbol containing an “X”. The text above the comparison node reads “R S subscript t plus 1 comma S subscript T A R”. A rightward arrow from “Future State of The Tree” and an upward arrow from “Targeted State of the Tree” connect to this node. Two dashed horizontal arrows extend rightward from the comparison node toward a vertical divider line. The upper dashed line is labeled “Reward R subscript t plus 1”. The lower dashed line is labeled “State S subscript t plus 1 minus S subscript T A R”. To the right of the divider, their corresponding arrows labeled “R subscript t” and “S subscript t minus S subscript T A R” point right to another panel labeled “Evaluation”. On the far right is the rounded rectangular panel labeled “Evaluation”. Inside the panel is a neural network-style diagram composed of connected circular nodes. The text below the panel labels it as “Agent”. A feedback arrow extends upward from the “Evaluation” panel and points toward a white box near the upper-left area labeled “Decision in Strategies for Tree Maintenance”. Above the feedback arrow is the purple expression “Q S subscript t comma A subscript t”. Purple text below the decision box reads “Action A subscript t”. This arrow continues leftward towards the “Tree Growth Simulation” under “Environment”.Structure of the DRL model in decision-making
The diagram presents a workflow connecting a “Tree Growth Game” environment with a “D R L Model” for evaluating tree maintenance strategies. The diagram is divided into two large colored regions. The left pink region is labeled “Tree Growth Game” and the right blue-gray region is labeled “D R L Model”. On the left side, a rounded rectangular panel titled “Environment” contains a three-dimensional visualization in a three-dimensional coordinate plane labeled “Tree Growth Simulation”. A rightward arrow points to a curved panel labeled “Future State of The Tree” under “D R L Model”. Inside the curved panel labeled “Future State of The Tree” is a green voxel-style three-dimensional tree structure. Purple text below the panel labels it as “State S subscript t plus 1”. To the lower-right side of this panel is another curved white panel labeled “Targeted State of the Tree”. Inside this panel is a flattened green voxel-style tree canopy structure. The purple text below labels it as “Goal S subscript T A R”. Between the two state panels is a circular comparison symbol containing an “X”. The text above the comparison node reads “R S subscript t plus 1 comma S subscript T A R”. A rightward arrow from “Future State of The Tree” and an upward arrow from “Targeted State of the Tree” connect to this node. Two dashed horizontal arrows extend rightward from the comparison node toward a vertical divider line. The upper dashed line is labeled “Reward R subscript t plus 1”. The lower dashed line is labeled “State S subscript t plus 1 minus S subscript T A R”. To the right of the divider, their corresponding arrows labeled “R subscript t” and “S subscript t minus S subscript T A R” point right to another panel labeled “Evaluation”. On the far right is the rounded rectangular panel labeled “Evaluation”. Inside the panel is a neural network-style diagram composed of connected circular nodes. The text below the panel labels it as “Agent”. A feedback arrow extends upward from the “Evaluation” panel and points toward a white box near the upper-left area labeled “Decision in Strategies for Tree Maintenance”. Above the feedback arrow is the purple expression “Q S subscript t comma A subscript t”. Purple text below the decision box reads “Action A subscript t”. This arrow continues leftward towards the “Tree Growth Simulation” under “Environment”.Structure of the DRL model in decision-making
The choice of a specific DRL neural network architecture for training largely depends on the complexity and diversity of the input state and action space. Traditional Deep Q-learning network (DQN) algorithms (Mnih et al., 2015) are well-suited for problems involving discrete action spaces, for example, moving two paddles either up or down a fixed distance per frame in a game of Pong (Diallo et al., 2017). Deep Deterministic Policy Gradient (DDPG) algorithms (Lillicrap et al., 2019) excel in continuous action spaces, for example, releasing the stones with a certain direction and velocity in a curling game (Lee et al., 2018). However, many real-world applications require operating in a combined discrete-continuous action space. In optimizing tree pruning strategies, decisions must encompass both the selection of a pruning type (discrete) and the specification of intensity parameters, such as distance or depth (continuous). The Parameterized Deep Q-Network (P-DQN) algorithm (Xiong et al., 2018) extends the capabilities of DQN and DDPG by handling such hybrid discrete-continuous action spaces. The core idea behind P-DQN is to use a deep neural network to estimate the Q-values for discrete actions while simultaneously learning the parameters for continuous actions of each discrete action. The system’s objective is for the agent (P-DQN) to learn an optimal policy to sample from action from the actions defined in expression (1). This set consists of tuples where each tuple includes a discrete action selected from a set of action and a continuous parameter associated with each . A comprehensive overview of the pruning options and their corresponding confidence bounds were listed in appendix table A1.
Based on this framework of the DRL model, further details should be set for 1) the architecture of P-DQN (see “Agent” in Figure 5), and 2) the reward function (see “)” in Figure 5).
The architecture of P-DQN consists of two sub-networks: (1) a dual network to output Q-values for choosing the discrete action using a Dueling Deep Q-Network (DDQN) (Wang et al., 2016), and (2) a continuous action parameter network that predicts the parameters for each action . Two critical phases in training the P-DQN are exploration and exploitation. During exploration, the agent tries out new actions to observe their effects. This phase is essential for learning the accurate responses of the environment. During exploitation, the agent uses prior knowledge from policy to optimize between actions that may further maximize the reward. To balance these two phases, the P-DQN algorithm employs a strategy named epsilon-greedy. If the agent selects a random action with probability in exploration, the action with the highest Q-value will have probability in exploitation. This ensures the agent does not get stuck in suboptimal policies and continues searching for improvements. In training the P-DQN, prior experiences must be stored in a replay buffer, including all past states, actions, rewards, and next states. This consumes a high storage in RAM. During training, mini-batches of these prior data are sampled from the replay buffer to calculate the Q-value. Then, the Q-value is used to update a least-squares loss function, one of the most common loss functions in DQN. The loss eventually drives the P-DQN network to “learn” a better policy .
The general principle for the reward function is to reflect how close the tree state is to the target state . Intersection over Union (IoU) index (Li et al., 2021) was used to quantify this similarity. It describes the percentage of common voxels from the two voxel sets and among the union of them. In order that the agent could see these two sets of voxel states, the state that we used for training the P-DQN at round is calculated as .
Using all the settings introduced above, the DRL model and the tree growth game were run on a virtual machine equipped with eight virtual CPUs, a GPU t6435.nvidia-v100.1, and 32 GB RAM. Due to limitations in computational resources, the batch size and network dimensions were optimized to reduce RAM occupation (see section 2.3). Despite these optimizations, computational constraints also affected the number of training episodes that could be stably executed. Given the insufficient number of training episodes in the initial tree growth simulation (see section 4.1), we developed a further simplified version of the tree growth game (see section 2.1.2) to gain a deeper understanding of the novel workflow’s effects and limitations (see Section 4.2).
2.3 Further implementation details
An upper limit for round numbers was set for a single episode. In the first trained game version, the episode will be terminated after 20 rounds of decision-making and tree growth, and the tree states are reset. The batch size was set to only 64 due to RAM constraints. The dimensions of the critic and actor hidden layers were tuned to 256 × 128 × 64 to balance overfitting and underfitting. Initially, random actions were selected for the first 64 batches to establish a baseline for Q-value estimation. Afterward, the P-DQN network was trained using the prior actions and began choosing actions, although random actions were still selected periodically. The discount factor gamma was set to 0.99 to calculate the Q-value using the Bellman equation. This parameter configuration biases the agent towards the exploitation phase, favoring actions determined by the P-DQN network while allowing for some degree of exploration through random action selection. Learning rates was tuned from 0.001 (Xiong et al., 2018) to 1e−4 to ensure stable updates. In addition, an unrealistic parameter in the action could chop off the whole tree. In this case, if the total cylinder numbers of the QSM state went below 20, this episode would terminate immediately with a negative reward of −1. Finally, the IoU scores were calculated only for voxels with LAD values larger than 0.05 m2/m3. A penalty of −0.2 points was applied to every round later than 10 to encourage the model to approach the target in earlier rounds. The effectiveness of the reward and penalty system is discussed in section 4.2.
In the further simplified game version with binary voxels, the maximum round limit for each episode was increased to 30. The batch size was increased to 512 to lower RAM usage when running the simplified game. The dimensions for critic and actor hidden layers were accordingly upgraded to 256 × 128 × 64 × 32, and convolutional neural network (CNN) was used as feature extractors in the hidden layers. The discount factor gamma was still set to 0.99 encountering any issues. The learning rate was further tuned to 1e−5 to reduce loss fluctuations and stabilize training. An upper limit for episodes was set to 2000, which was not reached (see section 3.2). For bad actions that deleted all the solid voxels, the penalty was increased to −100 to be comparable with the cumulative rewards within 30 rounds. The state feeding to the P-DQN network used in the simplified game was exactly the binary voxels of the current leaf areas.
During training, the agent stores experience tuples—consisting of the current state, selected action, received reward, and next state—into a replay memory implemented as a queue. This memory retains up to 1e6 tuples, discarding the oldest entries as new data is added. To improve learning stability, the agent updates its parameters twice per step by randomly sampling from stored experiences. This practice is widely accepted for mitigating the risk of overfitting to specific experiences sequences (Mnih et al., 2015). The source code for the game and the DRL model, including these detailed settings, is available on our GitHub page (Shu and Boey, 2024).
3. Results
3.1 The model trained on the first version of the tree growth game
The training process on the first game version managed to reach 60 episodes. For every episode, their total rounds, consumed time, and the frequency of chosen actions are shown in Appendix Figure A1. The first 7 episodes used random actions. From the 8th episode onwards, the P-DQN network took effects to make decisions on actions. With these decisions, more than half of the episodes lasted longer than 15 rounds (see appendix Figure A1a), which means the parameters of those actions were, at least, not chopping the whole tree off. This was also indirectly proven by the time consumption (see appendix Figure A1b). Longer time in the growth simulation at later episodes indicated the tree had grown larger with more cylinders.
The most chosen actions by the P-DQN network had shifted from Raising (action 1) to Thinning (action 0) and Reduction from the south (action 3) (see appendix Figure A1c). Among these 3 most used actions (see appendix Figure A1d), the thinning was carried out very gently in most of the cases with a minimum distance threshold for branches less than 5 mm. On the contrary, raising the crown was carried out much more intensively with a raised height of mostly 9–10 meters. The reduction from the south was carried with both short- and long-distance depth.
Figure 6 illustrates the changes in the branches and leaves of the tree in episode 55. Round 4 in this episode has won the highest maximum reward in all the episodes. Each round was also noted with its corresponding scores and decisions in action. Most thinning decisions (action 0) were with a small parameter. They generally encouraged the free growth of the crown, which won mostly an increasing score. However, the two raising decisions (action 1 at round 5 and 9) and the two reductions from the south (action 3 at round 4 and 12) cut too many branches, reducing the IoU score significantly.
The two horizontal sections show sequential rounds of a tree growth and reinforcement learning process. Each round contains two three-dimensional coordinate plots. The upper plot in each round is labeled “Branch State in Q S M” and displays a detailed branch structure of a tree. The lower plot is labeled “Leaf State in L A D” and displays a voxelized green canopy structure. Below each pair of plots are corresponding “Score”, “Action”, and “Parameter” values. The final column on the lower-right is labeled “Target” and shows the target voxel canopy configuration. The top section displays rounds 0 through 7. Round 0: The branch structure is compact with a small canopy. The leaf voxel structure forms a rounded compact cluster. Score: 6.43. Round 1: The branch canopy becomes denser and slightly larger. The voxel canopy expands upward and outward. Score: 10.03. Round 2: The branch structure further increases in density and width. The voxel canopy becomes larger and more spherical. Score: 13.56. Round 3: The branch structure grows slightly taller and fuller. The voxel canopy enlarges further with greater density. Score: 15.33. Round 4: The branch structure reaches one of the densest canopy states in the sequence. The voxel canopy appears broad and compact. Score: 19.10. Round 5: The branch structure becomes narrow with a tall trunk-like form and sparse upper branches. The voxel canopy changes into a vertically elongated cluster. Score: 6.04. Round 6: The branch structure remains slender and vertical with sparse branching. The voxel canopy becomes thinner and more irregular. Score: 5.62. Round 7: The branch structure becomes slightly fuller than the previous round while remaining vertically narrow. The voxel canopy becomes taller and denser. Score: 10.07. The lower section displays rounds 8 through 14 and the target state. Round 8: The branch structure is vertically elongated with sparse branches. The voxel canopy forms a tall, dense column. Score: 12.27. Round 9: The branch structure becomes denser and slightly wider. The voxel canopy expands upward and outward. Score: 14.19. Round 10: The branch structure shifts into a bent upper canopy form. The voxel canopy becomes asymmetrical with an enlarged upper section. Score: 6.20. Round 11: The branch structure remains asymmetrical and sparse. The voxel canopy becomes irregular with separated lower voxels. Score: 4.71. Round 12: The branch structure thickens slightly and extends upward. The voxel canopy becomes denser with a broader upper section. Score: 8.39. Round 13: The branch structure becomes narrow and tall again. The voxel canopy forms a vertical elongated mass with a side extension near the top. Score: 6.42. Round 14: The branch structure remains vertically narrow with moderate branching. The voxel canopy becomes irregular and asymmetrical with detached lower sections. Score: 4.57. The “Target” panel displays the desired voxel canopy state as a dense, horizontally spread green voxel cluster occupying a compact low-height region within the coordinate space. The Action and Parameter are labeled below a rightward curved arrow between two consecutive plots as follows: Between rounds 0 and 1: Action: 0. Parameter: 0.016. Between rounds 1 and 2: Action: 0. Parameter: 0.037. Between rounds 2 and 3: Action: 0. Parameter: 0.016. Between rounds 3 and 4: Action: 0. Parameter: 0.015. Between rounds 4 and 5: Action: 3. Parameter: 8.93. Between rounds 5 and 6: Action: 1. Parameter: 9.88. Between rounds 6 and 7: Action: 0. Parameter: 0.047. Between rounds 7 and 8: Action: 0. Parameter: 0.008. Between rounds 8 and 9: Action: 0. Parameter: 0.003. Between rounds 9 and 10: Action: 1. Parameter: 9.30. Between rounds 10 and 11: Action: 0. Parameter: 0.019. Between rounds 11 and 12: Action: 0. Parameter: 0.029. Between rounds 12 and 13: Action: 3. Parameter: 9.34. Between rounds 13 and 14: Action: 0. Parameter: 0.011. Between rounds 14 and “Target”: Action: 0. Parameter: 0.016 below a dashed curved arrow. Curved arrows between consecutive rounds indicate the transition sequence from one round to the next. Dashed arrows near the final rounds indicate progression toward the target state.Rendered records of the tree growth game at Episode 55 with the maximum rewards at round 4
The two horizontal sections show sequential rounds of a tree growth and reinforcement learning process. Each round contains two three-dimensional coordinate plots. The upper plot in each round is labeled “Branch State in Q S M” and displays a detailed branch structure of a tree. The lower plot is labeled “Leaf State in L A D” and displays a voxelized green canopy structure. Below each pair of plots are corresponding “Score”, “Action”, and “Parameter” values. The final column on the lower-right is labeled “Target” and shows the target voxel canopy configuration. The top section displays rounds 0 through 7. Round 0: The branch structure is compact with a small canopy. The leaf voxel structure forms a rounded compact cluster. Score: 6.43. Round 1: The branch canopy becomes denser and slightly larger. The voxel canopy expands upward and outward. Score: 10.03. Round 2: The branch structure further increases in density and width. The voxel canopy becomes larger and more spherical. Score: 13.56. Round 3: The branch structure grows slightly taller and fuller. The voxel canopy enlarges further with greater density. Score: 15.33. Round 4: The branch structure reaches one of the densest canopy states in the sequence. The voxel canopy appears broad and compact. Score: 19.10. Round 5: The branch structure becomes narrow with a tall trunk-like form and sparse upper branches. The voxel canopy changes into a vertically elongated cluster. Score: 6.04. Round 6: The branch structure remains slender and vertical with sparse branching. The voxel canopy becomes thinner and more irregular. Score: 5.62. Round 7: The branch structure becomes slightly fuller than the previous round while remaining vertically narrow. The voxel canopy becomes taller and denser. Score: 10.07. The lower section displays rounds 8 through 14 and the target state. Round 8: The branch structure is vertically elongated with sparse branches. The voxel canopy forms a tall, dense column. Score: 12.27. Round 9: The branch structure becomes denser and slightly wider. The voxel canopy expands upward and outward. Score: 14.19. Round 10: The branch structure shifts into a bent upper canopy form. The voxel canopy becomes asymmetrical with an enlarged upper section. Score: 6.20. Round 11: The branch structure remains asymmetrical and sparse. The voxel canopy becomes irregular with separated lower voxels. Score: 4.71. Round 12: The branch structure thickens slightly and extends upward. The voxel canopy becomes denser with a broader upper section. Score: 8.39. Round 13: The branch structure becomes narrow and tall again. The voxel canopy forms a vertical elongated mass with a side extension near the top. Score: 6.42. Round 14: The branch structure remains vertically narrow with moderate branching. The voxel canopy becomes irregular and asymmetrical with detached lower sections. Score: 4.57. The “Target” panel displays the desired voxel canopy state as a dense, horizontally spread green voxel cluster occupying a compact low-height region within the coordinate space. The Action and Parameter are labeled below a rightward curved arrow between two consecutive plots as follows: Between rounds 0 and 1: Action: 0. Parameter: 0.016. Between rounds 1 and 2: Action: 0. Parameter: 0.037. Between rounds 2 and 3: Action: 0. Parameter: 0.016. Between rounds 3 and 4: Action: 0. Parameter: 0.015. Between rounds 4 and 5: Action: 3. Parameter: 8.93. Between rounds 5 and 6: Action: 1. Parameter: 9.88. Between rounds 6 and 7: Action: 0. Parameter: 0.047. Between rounds 7 and 8: Action: 0. Parameter: 0.008. Between rounds 8 and 9: Action: 0. Parameter: 0.003. Between rounds 9 and 10: Action: 1. Parameter: 9.30. Between rounds 10 and 11: Action: 0. Parameter: 0.019. Between rounds 11 and 12: Action: 0. Parameter: 0.029. Between rounds 12 and 13: Action: 3. Parameter: 9.34. Between rounds 13 and 14: Action: 0. Parameter: 0.011. Between rounds 14 and “Target”: Action: 0. Parameter: 0.016 below a dashed curved arrow. Curved arrows between consecutive rounds indicate the transition sequence from one round to the next. Dashed arrows near the final rounds indicate progression toward the target state.Rendered records of the tree growth game at Episode 55 with the maximum rewards at round 4
In an overall trend (see Figure 7), the maximum reward in each episode, the average reward, and the cumulative reward all climb with the increasing training episode numbers. The cumulative reward and maximum reward climb faster than the average reward. Within the 60 episodes, a stable pruning strategy had not been achieved. Therefore, we could not see the further development of these gaining rewards.
The chart titled “Reward by Episode” displays reward values across training episodes using three line graphs and their corresponding linear regression trend lines. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The left vertical axis is labeled “Reward in Single Round” and ranges from 0 to 20 in increments of 2. The right vertical axis is labeled “Cumulative Reward” and ranges from 0 to 200 in increments of 20. Three primary data series are shown. A green line with circular markers represents “Max Reward per Round”. A blue line with square markers represents “Average Reward per Round”. An orange line with triangular markers represents “Cumulative Reward”. Dashed trend lines of matching colors indicate linear regression trends for each reward type. The green “Max Reward per Round” series fluctuates strongly across the episodes. It begins near 0 at episode 0, rises above 5 by episode 2, and continues with multiple peaks and drops throughout the chart. Major peaks occur near episode 12 at about 12.5, episode 20 at about 17.8, episode 33 at about 12.6, episode 44 at about 15.8, episode 46 at about 15.9, episode 48 at about 15.5, episode 51 at about 15.5, episode 54 at about 15.7, and episode 55 at 19.0, which is the highest value in the series. Several episodes such as 0, 4, 23, 28, 45, 49, and 56 show values near 0. The green dashed regression line shows a gradual upward trend from about 4.5 at episode 0 to about 9.5 at episode 60. The blue “Average Reward per Round” series varies within a narrower range. It begins near 0 at episode 0 and mostly fluctuates between 1 and 6. Higher values occur near episode 20 at about 9.6, episode 44 at about 9.8, episode 48 at about 10.0, and episode 55 at about 9.0. Several dips to 0 appear near episodes 0, 23, 28, 45, 49, and 56. The blue dashed regression line shows a modest upward trend from about 2.3 at episode 0 to about 4.3 at episode 60. The orange “Cumulative Reward” series displays the largest fluctuations and corresponds to the right vertical axis. It begins near 0 at episode 0 and varies widely across the chart. Moderate peaks occur near episode 11 at about 70, episode 31 at about 115, and episode 35 at about 120. The highest peaks occur near episode 44 at approximately 195 and episode 48 at 200. Additional large peaks occur near episode 46 at about 150 and episode 55 at about 135. Multiple episodes show cumulative rewards close to 0, including episodes 0, 4, 23, 24, 28, 29, 45, 49, and 56. The orange dashed regression line trends upward gradually from about 25 at episode 0 to about 68 at episode 60. Three legends appear on the right side of the chart, identifying the solid data lines, the cumulative reward series, and the dashed regression lines, respectively. Note: All numerical data values are approximated.The trend of the gained reward using linear regression
The chart titled “Reward by Episode” displays reward values across training episodes using three line graphs and their corresponding linear regression trend lines. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The left vertical axis is labeled “Reward in Single Round” and ranges from 0 to 20 in increments of 2. The right vertical axis is labeled “Cumulative Reward” and ranges from 0 to 200 in increments of 20. Three primary data series are shown. A green line with circular markers represents “Max Reward per Round”. A blue line with square markers represents “Average Reward per Round”. An orange line with triangular markers represents “Cumulative Reward”. Dashed trend lines of matching colors indicate linear regression trends for each reward type. The green “Max Reward per Round” series fluctuates strongly across the episodes. It begins near 0 at episode 0, rises above 5 by episode 2, and continues with multiple peaks and drops throughout the chart. Major peaks occur near episode 12 at about 12.5, episode 20 at about 17.8, episode 33 at about 12.6, episode 44 at about 15.8, episode 46 at about 15.9, episode 48 at about 15.5, episode 51 at about 15.5, episode 54 at about 15.7, and episode 55 at 19.0, which is the highest value in the series. Several episodes such as 0, 4, 23, 28, 45, 49, and 56 show values near 0. The green dashed regression line shows a gradual upward trend from about 4.5 at episode 0 to about 9.5 at episode 60. The blue “Average Reward per Round” series varies within a narrower range. It begins near 0 at episode 0 and mostly fluctuates between 1 and 6. Higher values occur near episode 20 at about 9.6, episode 44 at about 9.8, episode 48 at about 10.0, and episode 55 at about 9.0. Several dips to 0 appear near episodes 0, 23, 28, 45, 49, and 56. The blue dashed regression line shows a modest upward trend from about 2.3 at episode 0 to about 4.3 at episode 60. The orange “Cumulative Reward” series displays the largest fluctuations and corresponds to the right vertical axis. It begins near 0 at episode 0 and varies widely across the chart. Moderate peaks occur near episode 11 at about 70, episode 31 at about 115, and episode 35 at about 120. The highest peaks occur near episode 44 at approximately 195 and episode 48 at 200. Additional large peaks occur near episode 46 at about 150 and episode 55 at about 135. Multiple episodes show cumulative rewards close to 0, including episodes 0, 4, 23, 24, 28, 29, 45, 49, and 56. The orange dashed regression line trends upward gradually from about 25 at episode 0 to about 68 at episode 60. Three legends appear on the right side of the chart, identifying the solid data lines, the cumulative reward series, and the dashed regression lines, respectively. Note: All numerical data values are approximated.The trend of the gained reward using linear regression
3.2 The model trained on the further simplified binary game
In the training with the binary tree growth game, 1,000 episodes were easily reached. The first 175 episodes used random actions, while the P-DQN network decided the actions in later episodes. A stabilized pruning strategy was achieved after 700 episodes (see Figure 8). Statistics regarding time consumption and frequency of actions are shown in appendix figure A2. Most of the episodes with 30 rounds were finished within 10 s (see appendix figure A2b). The most chosen actions changed from reduction from the south to topping at around 200 to 300 episodes. Afterward, the thinning dominated the pruning decisions (see appendix figure A2c). Among these three most used actions, the thinning (action 0) was conducted mostly gently and some with middle intensity. The reduction from the south (action 3) and topping (action 7) mainly used a short distance or depth (see appendix figure A2d).
The chart titled “Reward by Episode” displays reward values across one thousand training episodes using three line graphs and their corresponding linear regression trend lines. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The left vertical axis is labeled “Reward in Single Round” and ranges from negative 100 to 50 in increments of 10. The right vertical axis is labeled “Cumulative Reward” and ranges from negative 100 to 550 in increments of 50. Three primary data series are shown. A green line with circular markers represents “Max Reward per Round”. A blue line with square markers represents “Average Reward per Round”. An orange line with triangular markers represents “Cumulative Reward”. Dashed trend lines of matching colors indicate linear regression trends for each reward type. The green “Max Reward per Round” series begins near negative 100 during the early episodes and fluctuates strongly. Between episodes 0 and 180, many values remain between negative 100 and 0. After episode 180, the series rises sharply and stabilizes mostly between 10 and 20. Several prominent peaks occur near episodes 270 and 370 at 40 and 42, respectively. Additional peaks between 20 and 30 appear throughout episodes 450 to 800. The green dashed regression line shows a steady upward trend from about negative 15 near episode 0 to 30 near episode 1000. The blue “Average Reward per Round” series also begins near negative 100 and shows large fluctuations during the early episodes. From episode 0 to 180, many values remain between negative 100 and 0. After episode 180, the average reward gradually improves and fluctuates mostly between 0 and 10. Occasional drops to negative 100 continue near episodes 400, 600, and 650. The highest average reward values occur near episodes 260 and 380 at 18. The blue dashed regression line increases gradually from about negative 20 near episode 0 to 20 near episode 1000. The orange “Cumulative Reward” series corresponds to the right vertical axis and shows the largest variation. During the early episodes, many cumulative reward values remain near negative 100. Between episodes 200 and 500, the series fluctuates heavily between negative 100 and positive 250. Large peaks occur near episode 270 at 530 and near episode 380 at 550, which is the highest cumulative reward value in the chart. After episode 500, the cumulative reward stabilizes mostly between 200 and 250 with smaller fluctuations. The orange dashed regression line rises steadily from near 0 at episode 0 to 300 at episode 1000. Three legends appear on the right side of the chart, identifying the solid data lines, the cumulative reward series, and the dashed regression lines, respectively. Note: All numerical data values are approximated.The trend of the gained reward in the simplified tree growth game with binary voxels
The chart titled “Reward by Episode” displays reward values across one thousand training episodes using three line graphs and their corresponding linear regression trend lines. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The left vertical axis is labeled “Reward in Single Round” and ranges from negative 100 to 50 in increments of 10. The right vertical axis is labeled “Cumulative Reward” and ranges from negative 100 to 550 in increments of 50. Three primary data series are shown. A green line with circular markers represents “Max Reward per Round”. A blue line with square markers represents “Average Reward per Round”. An orange line with triangular markers represents “Cumulative Reward”. Dashed trend lines of matching colors indicate linear regression trends for each reward type. The green “Max Reward per Round” series begins near negative 100 during the early episodes and fluctuates strongly. Between episodes 0 and 180, many values remain between negative 100 and 0. After episode 180, the series rises sharply and stabilizes mostly between 10 and 20. Several prominent peaks occur near episodes 270 and 370 at 40 and 42, respectively. Additional peaks between 20 and 30 appear throughout episodes 450 to 800. The green dashed regression line shows a steady upward trend from about negative 15 near episode 0 to 30 near episode 1000. The blue “Average Reward per Round” series also begins near negative 100 and shows large fluctuations during the early episodes. From episode 0 to 180, many values remain between negative 100 and 0. After episode 180, the average reward gradually improves and fluctuates mostly between 0 and 10. Occasional drops to negative 100 continue near episodes 400, 600, and 650. The highest average reward values occur near episodes 260 and 380 at 18. The blue dashed regression line increases gradually from about negative 20 near episode 0 to 20 near episode 1000. The orange “Cumulative Reward” series corresponds to the right vertical axis and shows the largest variation. During the early episodes, many cumulative reward values remain near negative 100. Between episodes 200 and 500, the series fluctuates heavily between negative 100 and positive 250. Large peaks occur near episode 270 at 530 and near episode 380 at 550, which is the highest cumulative reward value in the chart. After episode 500, the cumulative reward stabilizes mostly between 200 and 250 with smaller fluctuations. The orange dashed regression line rises steadily from near 0 at episode 0 to 300 at episode 1000. Three legends appear on the right side of the chart, identifying the solid data lines, the cumulative reward series, and the dashed regression lines, respectively. Note: All numerical data values are approximated.The trend of the gained reward in the simplified tree growth game with binary voxels
The average and maximum reward per round and cumulative reward in the episode are shown in Figure 8. The P-DQN network rapidly learned to gain positive rewards after the 175 episodes with random actions. The reward was stabilized at 10 points per round on average at the end of the training. It found the “best” pruning strategy to win the highest reward in the binary tree growth game. The visualization of this process in episode 800 is illustrated in appendix figure A3. From this record, it was recognized that this strategy was almost a free expansion of the crown. Even when the solid voxels occupied almost the whole voxel space, the reward dominated by the IoU score was still promising. So, it has made the decision-making model too “lazy” to use other actions to approach a more precise crown geometry of the target.
4. Discussion
4.1 Evaluation of the trained model in the first game version
Based on the limited training episodes in the first simplified game, we can preliminarily conclude that the P-DQN network has gradually learned to avoid the penalty reward −1. It also means it has learned not to terminate the tree growth game by reducing the cylinder number below 20. This is explained by the climbing scores and increasing simulation time in the results.
However, for the specific action selection, it doesn’t seem like this network recognized the logic between the action, its parameters, the change of the tree states, and the reward. If a human player was handling the tree state in round 4 of episode 55, a better decision might be a reduction from the top to stop the growth in height while encouraging a horizontal expansion of the crown. But the trained P-DQN network was satisfied with cutting almost all the side branches away to win a smaller score. It repeated a similar choice later at round 12. It failed to explore other actions and parameters to win higher rewards.
There could be multiple reasons for such behaviors of the P-DQN network: (1) The major deficiency was the lack of episodes for exploring different combinations of action and parameters with random settings, especially for tree states with a larger crown. Random parameters had a high chance of killing the tree and terminating the game when the tree was small. Meanwhile, the high reward can only happen later in the game when the crown has grown large. Such tree states with large crowns only occurred when all the random actions were, by chance, preserving the tree growth. Therefore, a high number of experiments were required for the network to learn the effects of different actions on large trees. In addition, the frequency of actions and parameters shown in Appendix Figure A1 were unbalanced. These distributions indicate the insufficient diverse tree states and action combinations explored through the training process. Apparently, current tries did not make the network “experienced” enough with the huge action space under different tree states. (2) The clarity of rewards and penalties is another key factor in determining training performance. The IoU index used in the experiments gave a positive reward based on the percentage of an overlap of the tree state with the target voxels. Even when the state after the action overlapped less than the previous state, the reward was still positive but smaller. This may lead to confusion in identifying the true good actions. (3) The balance between exploitation and exploration by the gamma value also affected the results. It determined if the model relied more on familiar paths or would take risks exploring new actions. Setting the gamma to 0.99 may have strengthened the P-DQN network in taking actions and parameters that were accidentally tried in the random settings with smaller rewards. It became too conservative to explore other new actions and parameters.
The detailed design choices regarding rewards, penalties for suboptimal actions, and round limits above were determined based on the authors' expertise, complemented by iterative adjustments through multiple test runs. These revisions aimed to balance the learning process and ensure stable training dynamics. However, the current reward system may not represent its most optimized form. Further refinements and potential optimization strategies are discussed in Section 4.2.
4.2 Reflections on the results from the further simplified versions of games
Two valuable abilities of the P-DQN network have been addressed through the binary version of games: (1) it can handle a large hybrid action space consisting of discrete operation types and a parameter within a continuous domain to describe the intensity of the operation. (2) it can effectively get the proper operations to avoid penalties and gain relatively high rewards. At the same time, the following two obstacles were also seen in this work before it could be applied to the industry.
Due to computational constraints in this study, training a P-DQN network on a physiologically based tree growth simulation was not feasible. Instead, our experiments relied on a simplified tree growth simulation using binary voxels. While this approach significantly reduced computational demands, it does not fully capture the complexity of real tree growth dynamics. The binary voxel representation lacks physiological accuracy, meaning the simulated growth outcomes may not directly reflect real-world tree responses to pruning.
The objective of achieving a target leaf area remains ambiguous compared to well-defined tasks in other reinforcement learning applications, such as catching a ping-pong ball (Diallo et al., 2017). This ambiguity makes it challenging to establish a precise reward indexing system that effectively guides tree growth toward a desired crown geometry. As a result, the trained model may achieve high rewards through unintended or suboptimal strategies rather than following biologically meaningful growth patterns. To improve reward optimization, future studies could explore methodologies such as curriculum learning (Portelas et al., 2020), where the learning process is structured in a way that gradually increases the difficulty of tasks. Additionally, reward-shaping techniques (Viswanadhapalli et al., 2024) could be implemented to modify rewards that better align with the long-term objective of crown formation.
4.3 Visions of this study
Future developments in the hardware will solve the first obstacle regarding limitations in computational power. The second obstacle requires optimizing the reward system, the state description, or the network structure. We open-source the tree growth game to welcome researchers and other experts in reinforcement learning to test other parameters, reward indexes, and network structures for training a DRL model in playing the tree growth game.
Nevertheless, the configured P-DQN network was proven a feasible technical route in decision-making for tree pruning. It served as a starting point for even adding other tree-management decisions in the future.
Moreover, the tree growth games have proven flexible. They can be either simplified to reduce the computational load or deepened in certain aspects to address new features and boundary conditions, such as root distribution, water content, etc. In this way, the proposed framework combining a growth game and a DRL network can, in the future, also integrate new physiological models for simulating tree growth and even enriched data formats in describing tree states.
Finally, transitioning from single-tree decision-making to urban-scale green infrastructure management would necessitate more efficient algorithms or cloud-based distributed computing solutions (Popović et al., 2018; Rashid et al., 2018). As this transition also involves a substantial increase in data volume, this accumulation may paradoxically reduce computational costs and requirements. By leveraging empirical and tacit knowledge such as knowledge graphs (Wu, 2024) or advanced models like DeepSeek (DeepSeek-AI et al., 2024), patterns and relationships within the data can be systematically captured, enabling more efficient decision-making. Similar approaches may facilitate scalability for large-scale green infrastructure management in the future.
At that stage, real-time or simulated pruning decisions can be integrated into digital twins, enabling urban planners to monitor and optimize the long-term impacts of tree growth on ecosystem services using voxel-based representations. An arborist could receive real-time updates on target crown adjustments in voxel representations using a tablet or AR glasses. The decision-support tool would provide a visual guide for shaping the tree and estimate the number of years required to achieve the target geometry. Additionally, it could generate an optimized pruning schedule, specifying which branches should be removed in each phase, the appropriate tools required, and the estimated labor investment needed for execution. Such precise information and intuitive visualization enhance data-driven urban forestry management and foster public engagement in green infrastructure planning within broader smart city frameworks.
5. Conclusion
In a traditional workflow of designing and managing urban trees, designers drafted a fixed tree size on paper before the implementation. In contrast, the tree management specialists work hands-on for a longer period. Decisions for pruning branches were made without clear design intentions regarding the crown shapes. This workflow cannot address the multifunctional use of urban trees, offering larger shading areas to confront global warming. Therefore, we proposed a workflow where designers set targets for leaf areas based on different boundary conditions. The tree management specialists can be advised by a decision-support algorithm in tree management, especially for pruning, to guide the tree growth toward the design target.
Existing tools support the proposed workflow, including digital tree twins, LiDAR scanning, and QSM extraction. Tree growth simulation is feasible with FSPMs and resprouting predictions, while voxels serve as geometric primitives for defining leaf area targets. What is still missing is a decision-support mechanism that guides tree growth toward the design target. Therefore, this paper explored the feasibility of DRL in decision-making for tree management to complete this workflow.
A novel framework was developed to train a DRL network for this purpose. It consists of a tree growth game and a decision-making model. The tree growth game served as a simulation environment. It predicted the future state of a tree based on its reaction to stimuli and related growth simulations. This determined mostly a rewarding score. A P-DQN network was the decision-making agent. It received the states of trees and their corresponding rewards largely determined by the IoU index from the game. Based on prior experiments, Q-values were evaluated to tell which actions to take under which states of the tree to win a higher reward. The action space was restricted to 4 classical pruning practices: thinning, raising, reduction, and topping. Each pruning strategy can be described with an additional parameter to specify their intensities, such as depth and distances. By playing the tree growth game iteratively, accumulated experiences were expected to make the P-DQN network attain a smart pruning strategy.
With all the RAM space we had for storing the training data, the first trained model only completed 60 episodes in the game. Diverse tree states and action combinations were insufficiently explored. Within these limits, the increasing trend of the reward was clear. However, when looking at specific actions taken in different states, the reward could become even higher with “smarter” decisions. However, the P-DQN network cannot fully explore these possibilities and see their benefits. To get over this obstacle, the tree growth game was further simplified. Binary voxels were used to describe the occupancy of the leaves. Solid leaf voxels would expand to surrounding spaces in each round and could be deleted by pruning actions. There was no longer a complex tree growth simulation. With this binary version of the game, we could train the P-DQN network over 1,000 episodes and reach a stable pruning strategy. This model effectively learned to avoid the penalty of chopping off the tree and found a good “trick” in getting a high reward: allowing the canopy to grow freely and occupy the whole voxel space. Unfortunately, this way of getting a high reward was still not the purpose of developing this model.
With our experiments, we could prove that the P-DQN network has two great abilities in decision-making for tree management strategies: (1) making decisions in a hybrid action space that decides both discrete operation types and an additional continuous parameter to describe their intensities. (2) its effectiveness in ruling out actions that caused penalties and could increase the reward little by little, even in such a large decision space and complex environment. Except for its strength, we have also found two obstacles before this model could be applied to the industry: (1) lack of computational resources to train this decision support model with enough episodes in a close-to-real tree growth simulation environment. This can probably be solved by hardware development, more efficient algorithms, or distributed computing solutions. The future accumulation of urban tree data may also paradoxically reduce computational costs and requirements by leveraging empirical and tacit knowledge. (2) Another tricky task is to create a clear reward index to achieve the desired model outcome. Otherwise, the P-DQN network learned to win a good score but did not get the tree crowns close to the design target. This problem may be solvable by curriculum learning or reward-shaping techniques. Testing these alternatives requires interdisciplinary cooperation. Therefore, we released our source code for open access to encourage further tests from other researchers and experts.
Nevertheless, this work has ascertained a technical route to realize the proposed novel workflow, where the computational approach is used in complex decision-making regarding pruning positions on branches to guide the tree growth approaching a design goal. Beyond its technical contributions, the voxel-based approach for setting design targets provides an intuitive representation of urban trees within digital twins, facilitating collaboration among urban planners, arborists, municipal authorities, and even the public. This collaborative approach is crucial for optimizing long-term green infrastructure management while enhancing various ecosystem services, including urban cooling, stormwater regulation, and air quality improvement. In this regard, this study lays the foundation for the multifunctional use of urban trees, aligning with global challenges and contributing to developing resilient and climate-adaptive urban environments.
Erratum: It has come to the attention of the publisher that the article, Shu, Q., Boey, K.Z. and Ludwig, F. (2025), “Reinforcement learning-driven decision support for target-oriented branch pruning on urban trees”, Smart and Sustainable Built Environment, Vol. ahead-of-print No. ahead-of-print. https://doi.org/10.1108/SASBE-10-2024-0427 originally published Figures A1, A2 and A3 in the ‘Figures’ section and Figures 1, 2 and 3 in the ‘Appendix’ section in the online version. This has now been rectified. The publisher sincerely apologises for this error and for any inconvenience caused.
Thanks to Arne Hingst and Gehard Schubert for Coordinating the use of the LRZ Cluster computer. This resource is crucial in training the proposed reinforcement learning networks. We would also like to show our gratitude to our research partners in the DFG-DACH project: Halil Erdal, Thomas Rötzer, Astrid Reischl, Hans Pretzsch, Michael Hensel, Jakub Marcin, and Aljbin Ahmeti.
Conflicts of interest: The authors declare no conflict of interest.
Funding: This study was funded by the DFG-DACH project named Urban Green System 4.0 (grant number LU2505/2-1 AOBJ:683826 42).
Data availability: The data underlying this article are available in [Branch-Pruning-Game-on-Urban-Trees], at https://github.com/QiguanShu/Branch-Pruning-Game-on-Urban-Trees.
References
Appendix
1. Definitions for the action space
Action space for agents to decide in the tree growth game and its simplified binary version
| Action no. | Pruning type | Range | Meaning of the parameter | Range (in binary version) | Meaning of the parameter (in the binary version) |
|---|---|---|---|---|---|
| 0 | Thinning | (float) [0, 0.05] | Branches with a distance to any other branch below this number in meters will be cut | (float) [0, 0.5) | The rate of solid voxels to be deleted |
| 1 | Raising | (float) [0, 10] | Branches within this height in meters from the crown start will be cut | (Integer) [1, 10] | The number of solid voxel layers to be deleted from the bottom |
| 2 | Reduction east | (float) [0, 10] | Branches within this distance, meters from the crown’s outreach from the east, will be cut | (Integer) [1, 10] | The number of solid voxel layers to be deleted from the east |
| 3 | Reduction south | (float) [0, 10] | Branches within this distance, meters from the crown’s outreach from the south, will be cut | (Integer) [1, 10] | The number of solid voxel layers to be deleted from the south |
| 4 | Reduction west | (float) [0, 10] | Branches within this distance, meters from the crown’s outreach from the west, will be cut | (Integer) [1, 10] | The number of solid voxel layers to be deleted from the west |
| 5 | Reduction north | (float) [0, 10] | Branches within this distance, meters from the crown’s outreach from the north, will be cut | (Integer) [1, 10] | The number of solid voxel layers to be deleted from the north |
| 6 | Reduction top | (float) [0, 5] | Branches within this distance in meters below the crown’s top will be cut | (Integer) [1, 10] | The number of solid voxel layers to be deleted from the top |
| 7 | Topping | (integer) [0, 5] | Cylinders that within this number from an end of a branch will be cut | (float) (0, 5] | Delete voxels whose distance from their center to the mean center of all solid voxels is among the furthest ones within this range in meters |
| 8 | No action (only with a generic growth) | – | Do not conduct any manual pruning | – | Do not conduct any manual pruning |
| Action no. | Pruning type | Range | Meaning of the parameter | Range (in binary version) | Meaning of the parameter (in the binary version) |
|---|---|---|---|---|---|
| 0 | Thinning | (float) [0, 0.05] | Branches with a distance to any other branch below this number in meters will be cut | (float) [0, 0.5) | The rate of solid voxels to be deleted |
| 1 | Raising | (float) [0, 10] | Branches within this height in meters from the crown start will be cut | (Integer) [1, 10] | The number of solid voxel layers to be deleted from the bottom |
| 2 | Reduction east | (float) [0, 10] | Branches within this distance, meters from the crown’s outreach from the east, will be cut | (Integer) [1, 10] | The number of solid voxel layers to be deleted from the east |
| 3 | Reduction south | (float) [0, 10] | Branches within this distance, meters from the crown’s outreach from the south, will be cut | (Integer) [1, 10] | The number of solid voxel layers to be deleted from the south |
| 4 | Reduction west | (float) [0, 10] | Branches within this distance, meters from the crown’s outreach from the west, will be cut | (Integer) [1, 10] | The number of solid voxel layers to be deleted from the west |
| 5 | Reduction north | (float) [0, 10] | Branches within this distance, meters from the crown’s outreach from the north, will be cut | (Integer) [1, 10] | The number of solid voxel layers to be deleted from the north |
| 6 | Reduction top | (float) [0, 5] | Branches within this distance in meters below the crown’s top will be cut | (Integer) [1, 10] | The number of solid voxel layers to be deleted from the top |
| 7 | Topping | (integer) [0, 5] | Cylinders that within this number from an end of a branch will be cut | (float) (0, 5] | Delete voxels whose distance from their center to the mean center of all solid voxels is among the furthest ones within this range in meters |
| 8 | No action (only with a generic growth) | – | Do not conduct any manual pruning | – | Do not conduct any manual pruning |
Note(s): The parameters have different meanings in the binary version of the game
Source(s): Authors’ own work
2. Detailed Result Data
The four panels labeled “(a)”, “(b)”, “(c)”, and “(d)” are arranged in a two-by-two layout. The charts summarize reinforcement learning episode statistics, including played rounds, computation time, action frequencies, and parameter usage. Panel (a) is titled “Played Rounds per Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The vertical axis is labeled “Total Round” and ranges from 0 to 20 in increments of 2. Blue vertical bars represent the number of rounds played in each episode. Early episodes show low values between 1 and 8 rounds. From episode 6 onward, many episodes reach between 15 and 20 rounds. Several peaks reach the maximum value of 20 rounds, including around episodes 7, 9, 25, 31, 35, 36, 39, 40, 43, 45, 46, 47, 49, and 52. Lower values appear intermittently near episodes 23, 28, 30, 44, 53, 55, 56, and 58. Panel (b) is titled “Time Consumption by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The vertical axis is labeled “Total Time Taken (seconds)” and ranges from 0 to 16000 in increments of 2000. Blue vertical bars represent computation time for each episode. Early episodes mostly remain below 1000 seconds. Larger spikes begin after episode 30. Major peaks occur near episode 31 at about 6400 seconds, episode 35 at about 5600 seconds, episode 40 at about 5200 seconds, episode 44 at 15200 seconds, which is the highest value, episode 46 at about 7800 seconds, episode 47 at about 5200 seconds, episode 51 at about 10100 seconds, episode 52 at about 8800 seconds, and episode 55 at about 4100 seconds. Most remaining episodes remain below 3000 seconds. Panel (c) is titled “Frequency of the Chosen Actions by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 16 in increments of 2. Multiple colored line graphs represent frequencies of actions labeled “Action 0” through “Action 8”. A vertical black line near episode 8 separates the regions labeled “Random Actions” on the left and “Chosen by the P-D Q N Network” on the right. “Action 0”, shown in blue, becomes the dominant action after episode 10 and frequently ranges between 4 and 13 occurrences, with peaks near episodes 38 and 46. “Action 1”, shown in orange, shows several high peaks between episodes 10 and 18, including a maximum near 16 around episode 16, but decreases afterward. “Action 3”, shown in red, fluctuates mostly between 1 and 5 with a large spike near episode 51 reaching 11. The remaining actions, including Actions 2, 4, 5, 6, 7, and 8, generally remain below 5 occurrences across most episodes. Panel (d) is titled “Used Action Parameters”. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 50 in increments of 10. Overlapping histograms display parameter usage frequencies for “Action 0”, “Action 1”, and “Action 3”. Blue bars represent “Action 0”, Yellow bars represent “Action 1”, and red bars represent “Action 3”. The horizontal axis contains two parameter scales. The lower scale labeled “Parameters for Action 0” ranges from 0 to 0.05 in increments of 0.01. The upper scale labeled “Parameters for Action 1 and 3” ranges from 0 to 10 in increments of 1. For “Action 0”, the highest frequency occurs near parameter value 0 with 50 occurrences. Frequencies decrease gradually as parameter values approach 0.05. For “Action 1”, the highest frequency occurs near parameter value 10 with 43 occurrences. For “Action 3”, frequencies are distributed more broadly across the range from 0 to 10, with larger concentrations near parameter values 0 and 10. Note: All numerical data values are approximated.Statistics from the log file regarding the game round, consumed time, and actions chosen in the experiments
The four panels labeled “(a)”, “(b)”, “(c)”, and “(d)” are arranged in a two-by-two layout. The charts summarize reinforcement learning episode statistics, including played rounds, computation time, action frequencies, and parameter usage. Panel (a) is titled “Played Rounds per Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The vertical axis is labeled “Total Round” and ranges from 0 to 20 in increments of 2. Blue vertical bars represent the number of rounds played in each episode. Early episodes show low values between 1 and 8 rounds. From episode 6 onward, many episodes reach between 15 and 20 rounds. Several peaks reach the maximum value of 20 rounds, including around episodes 7, 9, 25, 31, 35, 36, 39, 40, 43, 45, 46, 47, 49, and 52. Lower values appear intermittently near episodes 23, 28, 30, 44, 53, 55, 56, and 58. Panel (b) is titled “Time Consumption by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The vertical axis is labeled “Total Time Taken (seconds)” and ranges from 0 to 16000 in increments of 2000. Blue vertical bars represent computation time for each episode. Early episodes mostly remain below 1000 seconds. Larger spikes begin after episode 30. Major peaks occur near episode 31 at about 6400 seconds, episode 35 at about 5600 seconds, episode 40 at about 5200 seconds, episode 44 at 15200 seconds, which is the highest value, episode 46 at about 7800 seconds, episode 47 at about 5200 seconds, episode 51 at about 10100 seconds, episode 52 at about 8800 seconds, and episode 55 at about 4100 seconds. Most remaining episodes remain below 3000 seconds. Panel (c) is titled “Frequency of the Chosen Actions by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 60 in increments of 5. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 16 in increments of 2. Multiple colored line graphs represent frequencies of actions labeled “Action 0” through “Action 8”. A vertical black line near episode 8 separates the regions labeled “Random Actions” on the left and “Chosen by the P-D Q N Network” on the right. “Action 0”, shown in blue, becomes the dominant action after episode 10 and frequently ranges between 4 and 13 occurrences, with peaks near episodes 38 and 46. “Action 1”, shown in orange, shows several high peaks between episodes 10 and 18, including a maximum near 16 around episode 16, but decreases afterward. “Action 3”, shown in red, fluctuates mostly between 1 and 5 with a large spike near episode 51 reaching 11. The remaining actions, including Actions 2, 4, 5, 6, 7, and 8, generally remain below 5 occurrences across most episodes. Panel (d) is titled “Used Action Parameters”. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 50 in increments of 10. Overlapping histograms display parameter usage frequencies for “Action 0”, “Action 1”, and “Action 3”. Blue bars represent “Action 0”, Yellow bars represent “Action 1”, and red bars represent “Action 3”. The horizontal axis contains two parameter scales. The lower scale labeled “Parameters for Action 0” ranges from 0 to 0.05 in increments of 0.01. The upper scale labeled “Parameters for Action 1 and 3” ranges from 0 to 10 in increments of 1. For “Action 0”, the highest frequency occurs near parameter value 0 with 50 occurrences. Frequencies decrease gradually as parameter values approach 0.05. For “Action 1”, the highest frequency occurs near parameter value 10 with 43 occurrences. For “Action 3”, frequencies are distributed more broadly across the range from 0 to 10, with larger concentrations near parameter values 0 and 10. Note: All numerical data values are approximated.Statistics from the log file regarding the game round, consumed time, and actions chosen in the experiments
The four panels labeled “(a)”, “(b)”, “(c)”, and “(d)” are arranged in a two-by-two layout. The charts summarize reinforcement learning episode statistics, including played rounds, computation time, action frequencies, and parameter usage. Panel (a) is titled “Played Rounds per Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The vertical axis is labeled “Total Round” and ranges from 0 to 30 in increments of 2. Blue vertical bars represent the number of rounds played in each episode. Episodes before about 170 mostly remain below 12 rounds, with a peak near 27 for episode 90. After about episode 180, many episodes rapidly increase and frequently reach between 20 and 30 rounds. From about episode 300 onward, most bars remain near the maximum value of 30 rounds with only occasional drops below 20. Panel (b) is titled “Time Consumption by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The vertical axis is labeled “Total Time Taken (seconds)” and ranges from 0 to 10. Early episodes before about 170 remain close to 0 seconds. Between episodes 180 and 300, the values fluctuate widely between about 1 and 10 seconds. After about episode 300, most episodes stabilize between 8 and 10 seconds with occasional decreases below 5 seconds. Several peaks slightly exceed 10 seconds near episodes 400, 620, and 820. Panel (c) is titled “Frequency of the Chosen Actions by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 30 in increments of 2. Multiple colored line graphs represent frequencies of actions labeled “Action 0” through “Action 8”. A vertical black line near episode 180 separates the regions labeled “Random Actions” on the left and “Chosen by the P-D Q N Network” on the right. “Action 0”, shown in blue, becomes the dominant action after episode 250 and frequently reaches values between 20 and 30 occurrences, with many peaks at the maximum value of 30. “Action 3”, shown in red, shows strong activity between episodes 200 and 260 with peaks between 18 and 24 before decreasing sharply afterward. “Action 7”, shown in gray, becomes prominent between episodes 240 and 300 with frequencies reaching about 22. “Action 8”, shown in yellow-green, briefly rises near episode 190 with frequencies around 15. The remaining actions mostly remain below 5 occurrences across most episodes. Panel (d) is titled “Used Action Parameters”. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 8000 in increments of 2000. Overlapping histograms display parameter usage frequencies for “Action 0”, “Action 3”, and “Action 7”. Blue bars represent “Action 0”, red bars represent “Action 3”, and gray bars represent “Action 7”. The horizontal axis contains two parameter scales. The lower scale labeled “Parameters for Action 0” ranges from 0 to 0.5 in increments of 0.1. The upper scale labeled “Parameters for Action 3 and 7” ranges from 0 to 10 in increments of 1. For “Action 0”, the highest frequency occurs near parameter value 0 with 8000 occurrences. Frequencies near parameter value 0.1 are about 3000, and near parameter value 0.5 are about 3500. For “Action 3”, the highest frequencies occur near parameter values between 0 and 1, with bars reaching about 600 occurrences near 0 and about 300 near 1. Frequencies decrease sharply beyond parameter value 2. For “Action 7”, the largest frequencies also occur near parameter values between 0 and 1, with bars near 0 reaching about 500 occurrences. Very few occurrences appear beyond parameter value 2. Note: All numerical data values are approximated.Statistics from the log file regarding the simplified tree growth game using binary voxels. Compacted information is total rounds, consumed time, and actions chosen for each episode
The four panels labeled “(a)”, “(b)”, “(c)”, and “(d)” are arranged in a two-by-two layout. The charts summarize reinforcement learning episode statistics, including played rounds, computation time, action frequencies, and parameter usage. Panel (a) is titled “Played Rounds per Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The vertical axis is labeled “Total Round” and ranges from 0 to 30 in increments of 2. Blue vertical bars represent the number of rounds played in each episode. Episodes before about 170 mostly remain below 12 rounds, with a peak near 27 for episode 90. After about episode 180, many episodes rapidly increase and frequently reach between 20 and 30 rounds. From about episode 300 onward, most bars remain near the maximum value of 30 rounds with only occasional drops below 20. Panel (b) is titled “Time Consumption by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The vertical axis is labeled “Total Time Taken (seconds)” and ranges from 0 to 10. Early episodes before about 170 remain close to 0 seconds. Between episodes 180 and 300, the values fluctuate widely between about 1 and 10 seconds. After about episode 300, most episodes stabilize between 8 and 10 seconds with occasional decreases below 5 seconds. Several peaks slightly exceed 10 seconds near episodes 400, 620, and 820. Panel (c) is titled “Frequency of the Chosen Actions by Episode”. The horizontal axis is labeled “Episode” and ranges from 0 to 1000 in increments of 100. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 30 in increments of 2. Multiple colored line graphs represent frequencies of actions labeled “Action 0” through “Action 8”. A vertical black line near episode 180 separates the regions labeled “Random Actions” on the left and “Chosen by the P-D Q N Network” on the right. “Action 0”, shown in blue, becomes the dominant action after episode 250 and frequently reaches values between 20 and 30 occurrences, with many peaks at the maximum value of 30. “Action 3”, shown in red, shows strong activity between episodes 200 and 260 with peaks between 18 and 24 before decreasing sharply afterward. “Action 7”, shown in gray, becomes prominent between episodes 240 and 300 with frequencies reaching about 22. “Action 8”, shown in yellow-green, briefly rises near episode 190 with frequencies around 15. The remaining actions mostly remain below 5 occurrences across most episodes. Panel (d) is titled “Used Action Parameters”. The vertical axis is labeled “Frequency (times)” and ranges from 0 to 8000 in increments of 2000. Overlapping histograms display parameter usage frequencies for “Action 0”, “Action 3”, and “Action 7”. Blue bars represent “Action 0”, red bars represent “Action 3”, and gray bars represent “Action 7”. The horizontal axis contains two parameter scales. The lower scale labeled “Parameters for Action 0” ranges from 0 to 0.5 in increments of 0.1. The upper scale labeled “Parameters for Action 3 and 7” ranges from 0 to 10 in increments of 1. For “Action 0”, the highest frequency occurs near parameter value 0 with 8000 occurrences. Frequencies near parameter value 0.1 are about 3000, and near parameter value 0.5 are about 3500. For “Action 3”, the highest frequencies occur near parameter values between 0 and 1, with bars reaching about 600 occurrences near 0 and about 300 near 1. Frequencies decrease sharply beyond parameter value 2. For “Action 7”, the largest frequencies also occur near parameter values between 0 and 1, with bars near 0 reaching about 500 occurrences. Very few occurrences appear beyond parameter value 2. Note: All numerical data values are approximated.Statistics from the log file regarding the simplified tree growth game using binary voxels. Compacted information is total rounds, consumed time, and actions chosen for each episode
The two horizontal sections show sequential rounds of voxelized canopy growth represented in three-dimensional coordinate plots. Each panel is labeled “Leaf State in L A D” and displays a green voxel canopy structure inside a three-dimensional coordinate grid. Below each plot are corresponding “Score”, “Action”, and “Parameter” values. The final panel on the lower-right is labeled “Target” and shows the target voxel canopy configuration. The top section displays rounds 0 through 14 at intervals of two. Round 0: The voxel canopy is very small and compact near the center of the coordinate space. Score: 0.1. Round 2: The voxel canopy becomes slightly larger and wider with a rounded triangular shape. Score: 0.1. Round 4: The canopy expands further upward and outward, forming a denser mound-like structure. Score: 0.1. Round 6: The voxel canopy increases in width and height with a smoother dome-like form. Score: 0.1. Round 8: The canopy becomes denser and more elevated with a broad rounded surface. Score: 0.14. Round 10: The canopy enlarges significantly and forms a larger hemispherical structure. Score: 0.82. Round 12: The voxel structure becomes taller and fuller with increased density across the canopy surface. Score: 2.15. Round 14: The canopy grows into a broad dome-like structure occupying most of the coordinate space. Score: 4.48. The lower section displays rounds 16 through 30 and the target state. Round 16: The canopy becomes larger and more compact with a smoother curved upper surface. Score: 7.81. Round 18: The voxel canopy continues expanding upward and outward with increased density. Score: 10.84. Round 20: The canopy develops into a larger dome-like structure with broad horizontal spread. Score: 14.20. Round 22: The voxel canopy becomes taller and denser with a rounded upper surface. Score: 15.87. Round 24: The canopy reaches one of the densest and largest states in the sequence. Score: 15.90. Round 26: The canopy remains broad and dense with slight irregularities along the upper edge. Score: 14.73. Round 28: The voxel structure becomes slightly flatter along the top while maintaining a large spread. Score: 13.13. Round 30: The canopy remains large and dense with a rounded upper form occupying most of the coordinate space. Score: 11.99. The “Target” panel displays a compact rounded voxel canopy occupying a smaller region near the center-left of the coordinate grid. The Action and Parameter are labeled below a rightward curved arrow between two consecutive plots as follows: Between rounds 0 and 2: Action: 0. Parameter: 0.035 and 0.1. Between rounds 2 and 4: Action: 0. Parameter: 0.15 and 0.4. Between rounds 4 and 6: Action: 0. Parameter: 0.3 and 0.2. Between rounds 6 and 8: Action: 0. Parameter: 0.05 and 0.45. Between rounds 8 and 10: Action: 0. Parameter: 0.5 and 0.45. Between rounds 10 and 12: Action: 0. Parameter: 0.25 and 0.5. Between rounds 12 and 14: Action: 0. Parameter: 0.15 and 0.05. Between rounds 14 and 16: Action: 0. Parameter: 0.05 and 0.1. Between rounds 16 and 18: Action: 0. Parameter: 0.05 and 0.5. Between rounds 18 and 20: Action: 0. Parameter: 0.05 and 0.05. Between rounds 20 and 22: Action: 0. Parameter: 0.05 and 0.05. Between rounds 22 and 24: Action: 0. Parameter: 0.4 and 0.05. Between rounds 24 and 26: Action: 0. Parameter: 0.05 and 0.05. Between rounds 26 and 28: Action: 0. Parameter: 0.05 and 0.05. Between rounds 28 and 30: Action: 0. Parameter: 0.05 and 0.05. Curved arrows between consecutive rounds indicate the progression sequence from one canopy state to the next.Rendered records of the tree growth in the binary tree growth game at Episode 800, where the pruning strategy was stabilized. The starting state at round 0 and the target state were always the same in the training
The two horizontal sections show sequential rounds of voxelized canopy growth represented in three-dimensional coordinate plots. Each panel is labeled “Leaf State in L A D” and displays a green voxel canopy structure inside a three-dimensional coordinate grid. Below each plot are corresponding “Score”, “Action”, and “Parameter” values. The final panel on the lower-right is labeled “Target” and shows the target voxel canopy configuration. The top section displays rounds 0 through 14 at intervals of two. Round 0: The voxel canopy is very small and compact near the center of the coordinate space. Score: 0.1. Round 2: The voxel canopy becomes slightly larger and wider with a rounded triangular shape. Score: 0.1. Round 4: The canopy expands further upward and outward, forming a denser mound-like structure. Score: 0.1. Round 6: The voxel canopy increases in width and height with a smoother dome-like form. Score: 0.1. Round 8: The canopy becomes denser and more elevated with a broad rounded surface. Score: 0.14. Round 10: The canopy enlarges significantly and forms a larger hemispherical structure. Score: 0.82. Round 12: The voxel structure becomes taller and fuller with increased density across the canopy surface. Score: 2.15. Round 14: The canopy grows into a broad dome-like structure occupying most of the coordinate space. Score: 4.48. The lower section displays rounds 16 through 30 and the target state. Round 16: The canopy becomes larger and more compact with a smoother curved upper surface. Score: 7.81. Round 18: The voxel canopy continues expanding upward and outward with increased density. Score: 10.84. Round 20: The canopy develops into a larger dome-like structure with broad horizontal spread. Score: 14.20. Round 22: The voxel canopy becomes taller and denser with a rounded upper surface. Score: 15.87. Round 24: The canopy reaches one of the densest and largest states in the sequence. Score: 15.90. Round 26: The canopy remains broad and dense with slight irregularities along the upper edge. Score: 14.73. Round 28: The voxel structure becomes slightly flatter along the top while maintaining a large spread. Score: 13.13. Round 30: The canopy remains large and dense with a rounded upper form occupying most of the coordinate space. Score: 11.99. The “Target” panel displays a compact rounded voxel canopy occupying a smaller region near the center-left of the coordinate grid. The Action and Parameter are labeled below a rightward curved arrow between two consecutive plots as follows: Between rounds 0 and 2: Action: 0. Parameter: 0.035 and 0.1. Between rounds 2 and 4: Action: 0. Parameter: 0.15 and 0.4. Between rounds 4 and 6: Action: 0. Parameter: 0.3 and 0.2. Between rounds 6 and 8: Action: 0. Parameter: 0.05 and 0.45. Between rounds 8 and 10: Action: 0. Parameter: 0.5 and 0.45. Between rounds 10 and 12: Action: 0. Parameter: 0.25 and 0.5. Between rounds 12 and 14: Action: 0. Parameter: 0.15 and 0.05. Between rounds 14 and 16: Action: 0. Parameter: 0.05 and 0.1. Between rounds 16 and 18: Action: 0. Parameter: 0.05 and 0.5. Between rounds 18 and 20: Action: 0. Parameter: 0.05 and 0.05. Between rounds 20 and 22: Action: 0. Parameter: 0.05 and 0.05. Between rounds 22 and 24: Action: 0. Parameter: 0.4 and 0.05. Between rounds 24 and 26: Action: 0. Parameter: 0.05 and 0.05. Between rounds 26 and 28: Action: 0. Parameter: 0.05 and 0.05. Between rounds 28 and 30: Action: 0. Parameter: 0.05 and 0.05. Curved arrows between consecutive rounds indicate the progression sequence from one canopy state to the next.Rendered records of the tree growth in the binary tree growth game at Episode 800, where the pruning strategy was stabilized. The starting state at round 0 and the target state were always the same in the training
