Abstract

Large-scale, parallel clusters composed of commodity processors are increasingly available, enabling the use of vast processing capabilities and distributed RAM to solve hard search problems. We investigate parallel algorithms for optimal sequential planning, with an emphasis on exploiting distributed memory computing clusters. In particular, we focus on an approach which distributes and schedules work among processors based on a hash function of the search state. We use this approach to parallelize the A* algorithm in the optimal sequential version of the Fast Downward planner. The scaling behavior of the algorithm is evaluated experimentally on clusters using up to 128 processors, a significant increase compared to previous work in parallelizing planners. We show that this approach scales well, allowing us to effectively utilize the large amount of distributed memory to optimally solve problems which require hundreds of gigabytes of RAM to solve. We also show that this approach scales well for a single, shared-memory multicore machine.

Introduction

In classical planning, many problem instances remain hard for state-of-the-art planning systems. Both the memory and the CPU requirements are main causes of performance bottlenecks. The problem is especially pressing in sequential optimal planning. Despite significant progress in recent years in developing domain-independent admissible heuristics (Haslum and Geffner 2000; Edelkamp 2001; Helmert, Haslum, and Hoffmann 2007), scaling up optimal planning remains a challenge. Recent results suggest that improving heuristics may provide diminishing marginal returns (Helmert and Roger 2008), suggesting that research in orthogonal methods for speeding up search is necessary.

Multi-processor, parallel planning\(^1\) has the potential to provide both the memory and the CPU resources required to solve challenging problem instances. Parallel planning has received little attention in the past, two notable exceptions being the work of Zhou and Hansen (2007) and Burns et al. (2009). While multiprocessors were previously expensive and rare, multicore machines are now ubiquitous. Future generations of hardware are likely to continue to have an increasing number of processors, where the speed of each individual CPU core does not increase as rapidly as in past decades. Thus, exploiting parallelism will be the only way to extract significant speedups from the hardware. Since parallelism is a ubiquitous trend, there is a need to develop techniques to enable the domain-independent planning technology to scale further.

Previous work in parallel planning (Zhou and Hansen 2007; Burns et al. 2009) has taken a multi-threaded approach on a single, multicore machine. The number of processors is limited to relatively small values (typically up to 8). Thread-based approaches are specific to shared-memory environments (Lin and Snyder 2009), which have less memory and CPU cores than a distributed-memory environment. Zhou and Hansen (2007) address the memory bottleneck by resorting to the external memory, which introduces an additional time overhead caused by the expensive I/O operations.

Our goal is to push the scalability further by using the large memory and CPU resources available in distributed memory clusters.

We introduce Hash Distributed A* (HDA*), an algorithm that extends A* (Hart, Nilsson, and Raphael 1968) to a parallel environment. HDA* combines successful ideas from previous parallel algorithms, such as PRA* (Evett et al. 1995), which is based on A*, and TDS (Romein et al. 1999), a parallel version of IDA* (Korf 1985). As introduced in PRA*, a hash function assigns each state to a unique processor. A newly generated state is sent to its destination processor that generated it. The main advantage is that state duplicate detection can be performed locally, with no communication overhead. Despite this, PRA* incurs a considerable synchronization overhead caused by using synchronous communication. As in TDS, HDA* performs asynchronous, non-blocking communication. Unlike TDS, HDA* is a parallelization of A*. Besides performance, another key feature of HDA* is simplicity. Simplicity is especially important in parallel algorithms, as debugging a program on a multi-machine environment is very challenging.

We implement HDA* on top of the optimal version of the Fast Downward planner, which is described by Helmert,
Haslum, and Hoffmann (2007). Rather than using threads, our implementation is a distributed, message passing implementation using MPI (Snir and Gropp 1998), which allows parallelization in distributed memory environments as well as shared memory and mixed environments (cluster of multi-core machines), and support mechanisms for both synchronous and asynchronous communication.

The scaling behavior of the algorithm is evaluated experimentally on clusters using up to 128 processors, a significant increase compared to previous work in parallelizing planners. We show that this approach scales well, allowing us to effectively utilize the large amount of distributed memory to optimally solve problems which require hundreds of gigabytes of RAM. We also show that HDA* works well on a single, multi-core machine, outperforming algorithms such as PRA* and a parallel implementation of A* based on work stealing, a standard strategy in parallel search.

The rest of the paper is organized as follows. The next section presents background information, followed by related work. Our planning algorithm is described in the fourth section. We then present an empirical analysis, followed by concluding remarks and future work ideas.

Background

Efficient implementation of parallel search algorithms is challenging largely due to several types of overhead. Search overhead occurs when a parallel implementation of a search algorithm generates more states than a serial implementation. The main cause of search overhead is partitioning of the search space among processors, which has the side effect that the access to non-local information is restricted. For example, a sequential A* algorithm terminates immediately after a solution is found, as it is guaranteed to be optimal. In contrast, when a parallel A* algorithm finds a (first) solution, that is not necessarily an optimal solution. Better solutions might exist in non-local portions of the search space.

The search overhead can be negative, which means that parallel search expands fewer states than sequential search. A negative search overhead possibly results in achieving super-linear speedup and often indicates an inefficiency in the serial implementation or algorithm.

The synchronization overhead is the idle time wasted at synchronization points, where some processors have to wait for the others to reach the synchronization point. For example, in a shared-memory environment, the idle time can be caused by mutual exclusion (mutex) locks on shared data that cannot be accessed by more than one processor at a time. The communication overhead refers to the extra cost of inter-process information exchange, and mainly occurs in a distributed-memory environment.

The key to achieving a good speedup in parallel search is to minimize such overheads. This is often a difficult task, in part because the overheads depend on one another. For example, reducing the search overhead usually increases the synchronization and the communication overheads. It is hard to theoretically characterize the best trade-off in minimizing the overheads. In practice, the trade-offs are evaluated and tuned experimentally.

Work stealing is a standard approach for parallel search, and is used in many applications in shared-memory environments. It aims at achieving good load balancing (i.e., keep all processors busy at all times). In work-stealing, each processor maintains a local work queue. When a processor P generates new work (i.e., new states to be expanded) w, it places w in P’s own local queue. When P has no work in its queue, it steals work from the queue of a busy processor. Several strategies have been studied to select a processor offloading the work (e.g. (Feldmann 1993; Frigo, Leiserson, and Randall 1998; Rao and Kumar 1987)). A typical implementation of the local work queue for parallel A* is to simply use the local open list of a processor.

In many search applications, including planning benchmarks, the search space is a graph rather than a tree. More than one path can lead to the same state. Sequential best-first search can detect and handle this by using a closed list (i.e., hash table) or duplicate detection techniques (e.g. (Korf and Zhang 2000; Zhou and Hansen 2006)). Efficient duplicate detection is critical for performance, both in serial and parallel search algorithms, and can potentially eliminate vast amounts of redundant work.

In parallel search, performing duplicate state detection in parallel incurs several overheads. The cause of an overhead depends on the choices of algorithms and machine environments. In a shared-memory environment, many approaches, including work-stealing, need mutex operations for the open and closed lists, to guarantee that these structures are correctly managed. For example, an early work on parallel A* which shares one open list among all processors (Kumar, Ramesh, and Rao 1988) has a severe bottleneck caused by the contention for the open list.

A discussion on how such challenges are addressed in actual algorithms can be found in the next section.

Related Work

Parallel Retracting A* (PRA*) (Evett et al. 1995) simultaneously addresses the problem of work distribution and duplicate state detection. In PRA*, each processor maintains its own open and closed lists. A hash function maps each state to exactly one processor. When generating a state, PRA* distributes it to the corresponding processor. If the hash keys are distributed uniformly across the processors, load balancing is achieved. After receiving states, PRA* has the advantage that duplicate detection can be performed efficiently. All the checks are done locally at the destination process.

On the other hand, PRA* incurs a significant synchronization overhead, as it uses synchronous communication to distribute states. When a processor P generates a new state s and sends it to the destination processor Q, P blocks and waits for Q to confirm that s has successfully been received and stored. This is needed because PRA* was implemented on a Connection Machine, where each processor had a limited amount of local memory. When a processor’s memory became full, a retraction mechanism was used to remove nodes in order to free memory.

Transposition-table driven work scheduling (TDS) (Romein et al. 1999) is a distributed memory, parallel IDA* algorithm. Similarly to PRA*, TDS distributes work
using a state hash function. As opposed to PRA∗, it has no synchronization overhead. Extensive performance analysis on TDS was later performed (Romein et al. 2002). The transposition table is partitioned over processors to be used for detecting and pruning duplicate states that arrive at the processor. In this way, TDS exploits the large amounts of local memory that are available on modern machines. As the number of processing nodes increases, the amount of RAM in the system increases, allowing more effective duplicate state detection and pruning. This allows TDS to exhibit a very low (even negative) search overhead, compared to a sequential IDA∗ that runs on a single computational node, with a limited amount of memory.

A very important contribution of TDS is to make all communication asynchronous. After processor \( P \) sends a state to its destination \( q \), \( P \) expands the next state from its local open queue without waiting for \( q \) to reply. Instead, each processor periodically checks if a new state arrives. A possible concern with TDS is large communication overhead, but Romein et al. showed that this was not a significant concern because several states that a processor sends to the same destination can be packed into one message to reduce the communication overhead. TDS achieved impressive speedups in applications such as the 15-puzzle, the double-blank puzzle, and the Rubik’s cube, on a distributed-memory machine. The ideas behind TDS have also been successfully integrated in adversarial two-player search (Kishimoto and Schaeffer 2002; Romein and Bal 2003).

External memory search, a related but different technique, has been used to address memory bottlenecks (e.g., (Edelkamp and Jabbar 2006)). An issue with using external memory (such as disk) is the overhead of expensive I/O operations. In contrast, parallel search can potentially handle both memory and time bottlenecks. Interestingly, some solutions to reducing the I/O overhead in external memory search could in principle be adapted to handle the interprocess communication overhead in parallel search. For example, Zhou and Hansen (2007) and, more recently, Burns et al. (2009) adapt the idea of structured duplicate detection, which was originally introduced for external memory search (Zhou and Hansen 2004), to parallel search.

Zhou and Hansen introduce a parallel, breadth-first search algorithm. Parallel structured duplicate detection seeks to reduce synchronization overhead. The original state space is partitioned into collections of states called blocks. The duplicate detection scope of a state contains the blocks that correspond to the successors of that state. States whose duplicate detection scopes are disjoint can be expanded with no need for synchronization. Burns et al. (2009) have investigated best-first search algorithms that include enhancements such as structured duplicate detection and speculative search. These techniques were effective in a shared memory machine with up to 8 cores.

**Hash Distributed A∗**

We now describe HDA∗, a parallelization of A∗. HDA∗ is a simple algorithm which combines the hash-based work distribution strategy of PRA∗ and the asynchronous communications of TDS. Unlike PRA∗, HDA∗ does not incorporate any mechanism for node retraction. This combination results in a simple algorithm which achieves scalability for both speed and memory usage.

In HDA∗ the closed and open lists are implemented as a distributed data structure, where each processor “owns” a partition of the entire search space. The partitioning is done via a hash function on the state, and is described later.

The overall HDA∗ algorithm begins by the expansion of the start state at the head processor.

Each processor \( P \) executes the following loop until an optimal solution is found:

1. First, \( P \) checks if a new state has been received in its message queue. If so, \( P \) checks for this new state \( s \) in \( P \)'s closed list, in order to determine whether \( s \) is a duplicate, or whether it should be inserted in \( P \)'s local open list.

2. If the message queue is empty, then \( P \) selects a highest priority state from its local open list and expands it, resulting in newly generated states. For each of the newly generated states \( s_i \), a hash key \( K(s_i) \) is computed based on the state representation, and the generated state is then sent to the processor which owns \( K(s_i) \). This send is asynchronous and non-blocking. \( P \) continues its computation without waiting for a reply from the destination.

In a typical, straightforward implementation of a hash-based work distribution scheme on a shared memory machine, each processing thread owns a local open/closed list which is implemented in shared memory, and when a state is assigned to some thread, the writer thread obtains a lock on the target shared memory, writes the state, then releases the lock. Note that whenever a processor \( P \) “sends” a state \( s \) to a destination \( dest(s) \), then \( P \) must wait until the lock for shared open list (or message queue) for \( dest(s) \) is available and not locked by any other processor. This results in significant synchronization overhead – for example, it was observed in (Burns et al. 2009) that a straightforward implementation of PRA∗ exhibited extremely poor performance on the Grid search problem, where multi-core performance for up to 8 cores was consistently slower than sequential A∗. While it is possible to speed up locking operations by using, for example, highly optimized lock operations implementations in inline assembly language such as those which are commonly used in the two-player game community, the performance degradation due to the increase in synchronization points caused by locks remains a considerable problem (see discussion in the next section).

In contrast, the open/closed lists in HDA∗ are not explicitly shared among the processors. Thus, even in a multi-core environment where it is possible to share memory, all communications are done between separate MPI processes using non-blocking send/receive operations. Our implementation

\[^2\]Even if the heuristic function (Hellert, Haslum, and Hoffmann 2007) is consistent, parallel A∗ search may sometimes have to re-open the state saved in the closed list. For example, \( P \) may receive many identical states \( s \) with various priorities from different processors and these \( s \) may reach \( P \) in any order.
achieves it by using MPI_Bsend and MPI_Iprobe. This enables HDA* to utilize highly optimized message buffers implemented in MPI. Additionally, more than one state can be packed to reduce the number of MPI communications.

In parallel A*, even if a process discovers a goal state, it is not guaranteed to be optimal (Kumar, Ramesh, and Rao 1988). When a processor discovers a goal state \( G \), the processor broadcasts the cost of \( G \) to all processors. The search cannot terminate until all processors have proved that there is no solution with a cost better than that of \( G \). In order to correctly terminate parallel A*, it is not sufficient to check the local open list at every processor. We must also prove that there is no work (states) currently on the way to arrive at a processor. Various algorithms to handle termination exist. In our implementation of HDA*, we used the time algorithm of Mattern (1987), which was also used in TDS.

In a hash based work distribution scheme, the choice of the hash function is essential for achieving uniform distribution of the keys, which results in effective load balancing. Our implementation of HDA* uses the Zobrist function to map a SAS+ state representation (Bäckström and Nebel 1995) to a hash key. The Zobrist function (Zobrist 1970) is commonly used in the game tree search community. It is a very fast hash function based on incrementally XOR’ing the components of a state. The Zobrist function was previously used in a sequential planner by Botea et al. (2005).

### Results

We experimentally evaluated HDA* by running Fast Downward + HDA* on classical planning instances from the IPC-3, IPC-4, and IPC-6 planning competitions. Our experimental code is based on a sequential optimal version of Fast Downward, enhanced with an explicit state abstraction heuristic (Helmer, Haslum, and Hoffmann 2007). HDA* is implemented in C++ and compiled with gcc, parallelized using the MPI message passing library. While HDA* and other parallel search algorithms have nondeterministic behavior and there will be some differences between identical invocations of the algorithms, on the runs where we collected multiple data points, we did not observe significant differences between runs of HDA*. Therefore, due to the enormous resource requirements of a large-scale experimental study, the results shown are for single runs.

### Experiments on a Single, Multi-Core Machine

First, we investigate the scaling of HDA* on a single machine. We compare HDA* with sequential A* and shared-memory implementations of PRA* and WSA* (work-stealing A*). All of our algorithms use TCAlloc (http://code.google.com/p/google-perftools/), a fast and thread-safe memory management library. In addition to locks available in the Boost C++ library, we also incorporated spin locks used in GPShogi\(^6\) in order to speed up WSA* and PRA*. The spin lock implementation is based on the “xchgl” assembly operation.

In WSA*, there is a trade-off between searching the most promising states in parallel and working on the states in a local open list for avoiding synchronization overhead. The best work strategy is selected, after comparing several work-stealing implementations such as techniques in (Feldmann 1993; Kumar, Ramesh, and Rao 1988). Our WSA* manages the current best score of the states in the open lists. If a thread expands a less promising state in its local open list, it tries to steal work from a thread having the most promising states in the next state selection phase.

These experiments were run on a dual quad-core 2.66GHz Xeon E5430 with 6MB L2 cache (total of 8 cores) and 16 gigabytes of RAM. Each algorithm used the full 16 gigabytes of RAM. That is, \( n \)-core HDA* spawns \( n \) processes, each using \( 16/n \) gigabytes of RAM, sequential A* used the full 16GB available, and the multithreaded PRA* and WSA* algorithms shared 16GB of RAM among all of the threads.

Table 1 shows the speedup of HDA*, PRA*, and WSA* for 4 cores and 8 cores. In addition to runtimes for all algorithms, the speedup and the parallel efficiency are shown for HDA*. The efficiency of a parallel computation is defined as \( S/P \), where \( S \) is the speedup and \( P \) is the number of cores. As shown in Table 1, HDA* performs significantly better than WSA* and PRA*. With 4 cores, the speedup of HDA* ranges from 2.62 to 3.67, and the efficiency ranges from 0.65 to 0.92. With 8 cores, the speedup of HDA* ranges from 3.62 to 6.40, and the efficiency ranges from 0.45 to 0.80.

Although it is not possible to directly compare these results with the results in (Burns et al. 2009) due to many factors (different underlying search code which uses different heuristics, different # of cores, different problem instances\(^7\), etc.), it is possible to make some observations about comparative speedups. Compared to sequential A*, the algorithms proposed in (Burns et al. 2009) achieve excellent, even super-linear speedups, and this is because of techniques such as speculative expansion – their algorithms do not directly parallelize A*; rather, they implement a different, node expansion strategy in order to overcome parallelization overheads, and this resulted in a strategy which outperformed A*. On the other hand, if we look at the scaling behavior of each algorithm implementation as the number of cores is increased (i.e., comparison of the exact same code running on

\(^5\)We are using a shared cluster for our experiments, and large-scale experiments have issues of resource contention, because we are competing for these resources with hundreds of other users. In addition, some of the resources such as the 128GB RAM machine incur a usage cost per CPU hour.

\(^6\)http://gps.tanaka.ecc.u-tokyo.ac.jp/gpsshogi/pukiwiki.php?GPShogi

\(^7\)Different instances are used because key aspects of the algorithms, such as the heuristic, are different. Instances which are time-consuming for the algorithms in (Burns et al. 2009) are not necessarily difficult for HDA*, and vice versa – for example, the depots-7 and freecell-3 instances are solved in 9.9 seconds and 4.2 seconds by sequential Fast Downward (abstraction size parameter 1000). Instances solved quickly by sequential runs do not yield much useful information, because they are dominated by startup and termination overheads when parallelized. Thus, we chose our instances based on preliminary experiments which indicated the difficult problems for sequential Fast Downward+LFPA.
Table 1: Comparison of sequential A*, WSA* (work-stealing), PRA*, and HDA* on 1, 4, and 8 cores on a 2.66GHZ, 16GB 8-core Xeon E5430. Runtimes (in seconds), speedup, efficiency, and abstraction heuristic initialization times (not included in runtimes) are shown.

<table>
<thead>
<tr>
<th># of cores</th>
<th>A* time</th>
<th>WSA* time</th>
<th>PRA* time</th>
<th>HDA* time</th>
<th>speedup</th>
<th>eff.</th>
<th>time</th>
<th>WSA* time</th>
<th>PRA* time</th>
<th>HDA* time</th>
<th>speedup</th>
<th>eff.</th>
<th>time</th>
<th>Initialization</th>
</tr>
</thead>
<tbody>
<tr>
<td>Depots4</td>
<td>74.87</td>
<td>47.98</td>
<td>48.31</td>
<td>24.38</td>
<td>3.07</td>
<td>0.77</td>
<td>37.92</td>
<td>33.67</td>
<td>15.22</td>
<td>4.92</td>
<td>0.61</td>
<td>4.21</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Depots10</td>
<td>173.16</td>
<td>122.72</td>
<td>128.59</td>
<td>66.16</td>
<td>2.62</td>
<td>0.65</td>
<td>97.01</td>
<td>92.36</td>
<td>47.8</td>
<td>3.62</td>
<td>0.45</td>
<td>2.03</td>
<td></td>
<td></td>
</tr>
<tr>
<td>DriverLog8</td>
<td>95.82</td>
<td>80.19</td>
<td>74.29</td>
<td>33.54</td>
<td>2.86</td>
<td>0.71</td>
<td>68.44</td>
<td>54.85</td>
<td>24.01</td>
<td>3.99</td>
<td>0.5</td>
<td>0.12</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Freecell5</td>
<td>113.09</td>
<td>52.66</td>
<td>52.66</td>
<td>33.65</td>
<td>3.36</td>
<td>0.84</td>
<td>38.06</td>
<td>35.38</td>
<td>20.22</td>
<td>5.59</td>
<td>0.7</td>
<td>5.26</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Rover12</td>
<td>375.03</td>
<td>295.38</td>
<td>269.28</td>
<td>122.84</td>
<td>3.05</td>
<td>0.76</td>
<td>234.46</td>
<td>196.27</td>
<td>80.66</td>
<td>4.65</td>
<td>0.58</td>
<td>0.11</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Zenotravel9</td>
<td>135.88</td>
<td>119.06</td>
<td>106.65</td>
<td>47.4</td>
<td>2.87</td>
<td>0.72</td>
<td>103.98</td>
<td>79.76</td>
<td>35.69</td>
<td>3.81</td>
<td>0.48</td>
<td>0.16</td>
<td></td>
<td></td>
</tr>
<tr>
<td>PipesNoTk14</td>
<td>165.37</td>
<td>100.82</td>
<td>94.28</td>
<td>51.59</td>
<td>3.21</td>
<td>0.83</td>
<td>81.01</td>
<td>65.36</td>
<td>32.89</td>
<td>5.03</td>
<td>0.63</td>
<td>1.16</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pipes/Tank15</td>
<td>60.25</td>
<td>31.53</td>
<td>32.13</td>
<td>18.18</td>
<td>3.31</td>
<td>0.83</td>
<td>24.26</td>
<td>22.03</td>
<td>11.36</td>
<td>5.31</td>
<td>0.66</td>
<td>4.92</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pegsol26</td>
<td>44.88</td>
<td>27.45</td>
<td>26.22</td>
<td>12.83</td>
<td>3.5</td>
<td>0.87</td>
<td>22.6</td>
<td>18.57</td>
<td>8.83</td>
<td>5.08</td>
<td>0.64</td>
<td>6.64</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pegsol27</td>
<td>152.79</td>
<td>92.13</td>
<td>81.39</td>
<td>44.01</td>
<td>3.47</td>
<td>0.87</td>
<td>74.26</td>
<td>57.96</td>
<td>26.57</td>
<td>5.75</td>
<td>0.72</td>
<td>0.82</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pegsol28</td>
<td>661.14</td>
<td>392.82</td>
<td>363.61</td>
<td>190.57</td>
<td>3.47</td>
<td>0.87</td>
<td>318.79</td>
<td>249.95</td>
<td>115.43</td>
<td>5.73</td>
<td>0.72</td>
<td>0.49</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Sokoban4</td>
<td>38.45</td>
<td>22.66</td>
<td>21.65</td>
<td>10.71</td>
<td>3.59</td>
<td>0.9</td>
<td>18.42</td>
<td>15.2</td>
<td>6.27</td>
<td>6.13</td>
<td>0.77</td>
<td>2.5</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Sokoban9</td>
<td>43.69</td>
<td>26.24</td>
<td>24.29</td>
<td>12.34</td>
<td>3.54</td>
<td>0.89</td>
<td>21.62</td>
<td>17.02</td>
<td>6.82</td>
<td>6.4</td>
<td>0.8</td>
<td>1.4</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Sokoban13</td>
<td>41.59</td>
<td>27.22</td>
<td>25.84</td>
<td>12.85</td>
<td>3.24</td>
<td>0.81</td>
<td>22.06</td>
<td>18.05</td>
<td>7.44</td>
<td>5.59</td>
<td>0.7</td>
<td>0.77</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Sokoban30</td>
<td>290.39</td>
<td>143.19</td>
<td>145.53</td>
<td>79.05</td>
<td>3.67</td>
<td>0.92</td>
<td>105.97</td>
<td>98.46</td>
<td>45.92</td>
<td>6.32</td>
<td>0.79</td>
<td>1.05</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Figure 1: CPU Utilization of HDA* (top), PRA* (middle), and WSA* (bottom) on 8 core multi-core (Depots-10).

Table 1 shows the runtimes and speedup, and efficiency of HDA* on 16, 64, and 128 cores, relative to sequential A*.

Experiments on a Cluster

Next, we investigate the scaling behavior of HDA* for clusters of machines. These parallel experiments were performed on a Sun Fire X4600 cluster, where each node has 8 AMD dual core Opteron processors (total 16 cores per node) and 32 GB RAM per node, with a clock speed of 2.4GHz.

We used 1-8 nodes in our experiments (i.e., 16-128 cores).

Table 2 shows the runtimes and speedup, and efficiency of HDA* on 16, 64, and 128 cores, relative to sequential A*.

Due to the runtime scaling, as well as the architectural differences between Opteron and Xeon processors, the 1-core Opteron results in Tables 2-3 are not directly comparable with the 1-core Xeon results in Table 1.

8Due to the runtime scaling, as well as the architectural differences between Opteron and Xeon processors, the 1-core Opteron results in Tables 2-3 are not directly comparable with the 1-core Xeon results in Table 1.

9We are currently completing a straightforward parallelization
Table 2: Execution time (for search, excluding abstraction initialization), speedup, and efficiency on a large-scale cluster with up to 128 2.4GHz Opteron cores, 2GB RAM per core (Abstraction size = 1000). The 1-core results use a Opteron-based machine with 128GB RAM. “n/a” = failure due to exhausted memory.

<table>
<thead>
<tr>
<th>Problem</th>
<th>1 core time</th>
<th>16 cores time</th>
<th>speedup</th>
<th>eff</th>
<th>64 cores time</th>
<th>speedup</th>
<th>eff</th>
<th>128 cores time</th>
<th>speedup</th>
<th>eff</th>
</tr>
</thead>
<tbody>
<tr>
<td>Depot13</td>
<td>325.74</td>
<td>38.23</td>
<td>8.52</td>
<td>0.53</td>
<td>11.86</td>
<td>27.46</td>
<td>0.43</td>
<td>10.28</td>
<td>31.70</td>
<td>0.25</td>
</tr>
<tr>
<td>Rover12</td>
<td>521.13</td>
<td>58.09</td>
<td>8.97</td>
<td>0.56</td>
<td>16.09</td>
<td>32.38</td>
<td>0.31</td>
<td>10.01</td>
<td>52.04</td>
<td>0.41</td>
</tr>
<tr>
<td>Zeno Trav11</td>
<td>2688.93</td>
<td>n/a</td>
<td>n/a</td>
<td>n/a</td>
<td>82.41</td>
<td>32.63</td>
<td>0.51</td>
<td>44.90</td>
<td>59.88</td>
<td>0.47</td>
</tr>
<tr>
<td>PipesNoTk24</td>
<td>1269.16</td>
<td>165.59</td>
<td>7.66</td>
<td>0.48</td>
<td>42.27</td>
<td>30.03</td>
<td>0.47</td>
<td>34.20</td>
<td>37.11</td>
<td>0.29</td>
</tr>
<tr>
<td>Pegsol28</td>
<td>886.01</td>
<td>75.69</td>
<td>11.71</td>
<td>0.73</td>
<td>21.75</td>
<td>40.73</td>
<td>0.64</td>
<td>17.54</td>
<td>50.51</td>
<td>0.29</td>
</tr>
<tr>
<td>Pegsol29</td>
<td>4509.22</td>
<td>n/a</td>
<td>n/a</td>
<td>n/a</td>
<td>109.65</td>
<td>41.13</td>
<td>0.64</td>
<td>75.35</td>
<td>59.84</td>
<td>0.47</td>
</tr>
<tr>
<td>Sokoban12</td>
<td>466.90</td>
<td>37.47</td>
<td>12.46</td>
<td>0.78</td>
<td>14.24</td>
<td>32.80</td>
<td>0.51</td>
<td>12.37</td>
<td>37.75</td>
<td>0.29</td>
</tr>
<tr>
<td>Sokoban14</td>
<td>2201.76</td>
<td>n/a</td>
<td>n/a</td>
<td>n/a</td>
<td>55.75</td>
<td>39.50</td>
<td>0.62</td>
<td>34.48</td>
<td>63.86</td>
<td>0.50</td>
</tr>
<tr>
<td>Sokoban15</td>
<td>2639.30</td>
<td>n/a</td>
<td>n/a</td>
<td>n/a</td>
<td>76.55</td>
<td>34.48</td>
<td>0.54</td>
<td>45.15</td>
<td>58.46</td>
<td>0.46</td>
</tr>
<tr>
<td>Sokoban21</td>
<td>1529.45</td>
<td>145.76</td>
<td>10.49</td>
<td>0.66</td>
<td>45.74</td>
<td>33.44</td>
<td>0.52</td>
<td>29.25</td>
<td>52.28</td>
<td>0.41</td>
</tr>
<tr>
<td>Sokoban23</td>
<td>589.89</td>
<td>45.89</td>
<td>12.85</td>
<td>0.80</td>
<td>16.26</td>
<td>36.28</td>
<td>0.57</td>
<td>13.23</td>
<td>44.60</td>
<td>0.35</td>
</tr>
<tr>
<td>Sokoban24</td>
<td>950.55</td>
<td>76.82</td>
<td>12.37</td>
<td>0.77</td>
<td>26.11</td>
<td>36.41</td>
<td>0.57</td>
<td>18.06</td>
<td>52.62</td>
<td>0.41</td>
</tr>
<tr>
<td>Sokoban30</td>
<td>378.30</td>
<td>31.49</td>
<td>12.01</td>
<td>0.75</td>
<td>13.05</td>
<td>29.00</td>
<td>0.45</td>
<td>11.91</td>
<td>31.78</td>
<td>0.25</td>
</tr>
<tr>
<td>Sokoban25</td>
<td>n/a</td>
<td>n/a</td>
<td>n/a</td>
<td>n/a</td>
<td>n/a</td>
<td>n/a</td>
<td>n/a</td>
<td>129.05</td>
<td>n/a</td>
<td>n/a</td>
</tr>
<tr>
<td>Driverlog13</td>
<td>n/a</td>
<td>n/a</td>
<td>n/a</td>
<td>n/a</td>
<td>n/a</td>
<td>n/a</td>
<td>n/a</td>
<td>179.28</td>
<td>n/a</td>
<td>n/a</td>
</tr>
</tbody>
</table>

regardless of the number of cores. The abstraction heuristic initialization times are shown separately in Table 2. For example, the IPC6 Pegsol-28 instance, which requires 4509 seconds with 1 core, was solved in 75 seconds with 128 cores, plus 15.15 seconds for the abstraction table generation. The “n/a” in the Sokoban-14/15 entries for 16 cores indicates that 32GB was insufficient to solve these instances (additional memory would allow 16-cores to solve these instances). The Sokoban-25 and Driverlog-13 instances were only solved using 128 cores, because only the 128-core run had sufficient memory (256GB).

Overall, HDA* achieved a search speedup of 8-13 with 16 cores, 27-41 with 64 cores, and 31-64 with 128 cores, demonstrating reasonably good scalability for a large number of processors. The parallel efficiency of HDA* ranges between 0.48-0.77 for 16 cores, 0.43-0.64 for 64 cores, and 0.25-0.50 for 128 cores.

The search overhead, which indicates the extra states explored by parallel search, is defined as:

$$SO = 100 \times \left(\frac{\text{number of states searched by parallel search}}{\text{number of states searched by sequential search}} - 1\right)$$

Figure 2 shows the search overhead, plotted against the length of the optimal plan for the 16, 64, and 128 core data in Table 2. Although most of the data points are at the lower left corner (low search overhead), there are some data points with very high search overhead. The figure shows that search overhead in HDA* is clearly correlated with solution length. The data points in the right side, with long solutions, indicate that 32GB was insufficient to solve these instances. The Sokoban-14/15 entries for 16 cores indicates that 32GB was insufficient to solve these instances (additional memory would allow 16-cores to solve these instances). The Sokoban-25 and Driverlog-13 instances were only solved using 128 cores, because only the 128-core run had sufficient memory (256GB).

The size of the heuristic abstraction table is a control parameter for Fast Downward. While almost all of the data presented in this paper is based on a value of 1000 for the abstraction size parameter, preliminary experiments have shown that search speedups were not dependent on the abstraction size parameter. As an example, Table 3 compares search times for 1 and 16 cores using an abstraction size of 5000. The speedups and parallel efficiencies are comparable to the 16-core results in Table 2 for the same instances using abstraction size 1000. Of course, as the abstraction size parameter is increased, the amount of time spent for
Table 3: Search execution time, speedup, and efficiency for abstraction size = 5000. Times are in seconds (same machines as in Table 2).

<table>
<thead>
<tr>
<th></th>
<th>1 core time</th>
<th>16 cores time</th>
<th>speedup</th>
<th>effic</th>
<th>init time</th>
</tr>
</thead>
<tbody>
<tr>
<td>Rover12</td>
<td>404.51</td>
<td>48.04</td>
<td>8.44</td>
<td>0.53</td>
<td>1.12</td>
</tr>
<tr>
<td>ZenoTravel11</td>
<td>12.13</td>
<td>2.08</td>
<td>5.82</td>
<td>0.36</td>
<td>1.29</td>
</tr>
<tr>
<td>PipesNoTk24</td>
<td>162.52</td>
<td>25.33</td>
<td>6.42</td>
<td>0.40</td>
<td>21.81</td>
</tr>
<tr>
<td>Pegsol28</td>
<td>974.63</td>
<td>83.33</td>
<td>11.70</td>
<td>0.73</td>
<td>16.35</td>
</tr>
<tr>
<td>Sokoban24</td>
<td>979.21</td>
<td>102.05</td>
<td>9.6</td>
<td>0.60</td>
<td>6.52</td>
</tr>
</tbody>
</table>

Table 3: Search execution time, speedup, and efficiency for abstraction size = 5000. Times are in seconds (same machines as in Table 2).

Discussion and Conclusion

In order to scale up the capabilities of sequential optimal planning algorithms, this paper investigated the parallelization of the search algorithm. We developed Hash Distributed A*, a parallelization of A* for a distributed memory cluster. HDA* relies on two key ideas: (1) distribution of work using a hash value for generated states, which is from PRA* (Evett et al. 1995), and (2) completely asynchronous operation, which was shown to be effective in TDS, a parallel IDA* algorithm (Romein et al. 1999). We implemented HDA* as a replacement for the sequential A* search engine for the state-of-the-art, optimal sequential planner, Fast Downward+LFPA (Explicit State Abstraction Heuristic) (Helmer, Haslum, and Hoffmann 2007).

Our experimental evaluation shows that HDA* scales well, achieving 30–60x speedup on 128 processing cores. HDA* exploits the large amount of distributed memory available on a modern cluster, enabling larger problems to be solved than previously possible on a single machine. We are in the process of scaling the experiments to larger number of cores (up to 1024).

One particularly attractive feature of HDA* is its simplicity. Work distribution is done by a simple hash function, and there is no complex load balancing mechanism. All communications are asynchronous, so complex synchronization protocols are not necessary. Despite its simplicity, HDA* achieves significant speedup over the state-of-the-art, Fast Downward+LFPA planner. Simplicity for parallel algorithms is very important, particularly for an algorithm that runs on multiple machines, as debugging a multi-machine, multi-core algorithm is extremely challenging. For comparison, we have also started to implement a distributed memory, work-stealing algorithm, and have found that it is significantly more difficult to implement correctly and efficiently compared to HDA*.

While we developed HDA* for distributed memory parallel search on a distributed memory cluster of machines, we have also shown that HDA* achieves reasonable speedup on a single, shared memory machine with up to 8 cores, with results that are superior to two, previous approaches: thread-based work-stealing (WSA*) and PRA*. HDA* yields speedups of 3.6–6.3 on 8 cores. We have also shown that, on an 8-core machine, HDA* keeps all processors almost completely busy, while PRA* and WSA*, allow processors to be idle due to synchronization overhead.

Thus, the main contributions of this paper are: (1) the proposal of HDA*, a simple, parallel best-first algorithm combining the hash-based work distribution strategy of PRA* and the asynchronous communication strategy of TDS; (2) an experimental demonstration that HDA* can significantly speed up the state-of-the-art Fast-Downward+LFPA planner; (3) an experimental demonstration that HDA* scales up reasonably well to 128 cores; and (4) a demonstration that HDA* performs reasonably well on a single machine, outperforming standard thread-based techniques (WSA* and PRA*). This work has shown that HDA* is a promising approach to parallelizing best-first search for sequential optimal planning.

Currently, HDA* uses a single process per core. Although the machine it runs on can be a shared memory machine (most modern machines are multicore, shared memory machines), HDA* executes as a set of independent processes without sharing any memory resources among cores that are on the same machine. This means that the memory used for the LFPA abstraction heuristic (Helmer, Haslum, and Hoffmann 2007) is unnecessarily replicated n times on an n-core machine, which can be a significant source of inefficiency in memory usage. We are currently investigating a hybrid, distributed/shared memory implementation of HDA* which eliminates this inefficiency. One possible direction for such a hybrid implementation is to distribute work among machines using hash-based distribution, but within a single machine incorporating techniques such as speculative expansion that have been shown to scale well on a shared memory environment with up to 8 cores (Burns et al. 2009).

The speedups we obtain are more modest than the results obtained by Romein et al. (1999) for TDS in puzzle domains, who report linear speedups compared to sequential IDA*. One reason why such impressive speedups are possible for parallel IDA* might be that increasing the aggregate RAM results in a larger, distributed transposition table for
IDA*, which leads to more pruning, and therefore actually improves the search efficiency relative to sequential IDA* on a single machine with less aggregate RAM. In search spaces with duplicate states, the search overhead incurred by sequential IDA* by exploring duplicate states is enormous, and therefore, a massive, distributed transposition table results in search efficiency improvements which make up for any overheads incurred by parallelization. In contrast, for parallel A* algorithms, increasing the amount of aggregate RAM affects whether problems can be solved or not (i.e., whether memory is exhausted before search completes), but by itself, increased memory does not improve the number of nodes explored by the search algorithm, since sequential A* does not reopen duplicate states. On the other hand, it is also possible to use massive amounts of aggregate RAM in different ways to improve performance (e.g., increasing the size of an abstraction-based heuristic table). This remains an area for future work.

Another area of future work is an in-depth investigation of the scalability as the number of nodes increases. Our parallel experiments have used clusters of multicore nodes (16 cores per node), so even with 128 cores, this involved 8 nodes. Since inter-node communications overhead is more significant than intra-node communications within a single node, further investigation is needed to understand the impact of inter-node communications on the scalability of HDA*.

Finally, we note that the serial computation of the abstraction heuristic table (Helmer, Haslum, and Hoffmann 2007) results in a serial bottleneck, as illustrated in our results. We are currently parallelizing an abstraction algorithm.

Acknowledgements

This research is supported by the JSPS Compview GCOE, the Japan MEXT program, “Promotion of Env. Improvement for Independence of Young Researchers”, and the JST PRESTO. NICTA is funded by the Australian government’s Backing Australia’s Ability initiative.

References


