As presented in Section 4, the IsoDen method sorts particles in order of decreasing density, and then examines them in that order to generate a list of halos. At first glance, this procedure is inherently sequential, since changing the order of operations would significantly alter the result. Thus, we are forced to create an alternative formulation that achieves exactly the same result, but that allows for a mostly parallel implementation.
The initial step is the same. We calculate densities and linking lengths for individual particles. This calculation uses the same techniques, and is implemented using the same libraries, as the the density calculation required by the Smooth Particle Hydrodynamics method . By using the parallel tree libraries, it parallelizes immediately. The library handles all data structure manipulation and explicit communication on message passing parallel architectures. The library was originally designed to implement parallel ``Fast'' summation algorithms, e.g. , for the computation of gravitational interactions in time, but the data structures are far more general. The library has been ported to a wide variety of systems including message-passing and shared-memory machines, as well as networks of workstations and uniprocessor systems. In particular, the results presented here for FOF and IsoDen methods were obtained with library implementations which managed all parallelism and data distribution on single and multi-processor SPARC workstations, a 32-node CM-5 at the Australian National University and a 512-node Intel Paragon at Caltech.
A crucial feature of the density-estimation problem is that the particle positions and masses remain fixed throughout the algorithm, and thus may be freely copied between processors without concern for coherence. In fact, the library addresses a slightly more general case which occurs in dynamical simulations -- particle positions may updated only after an implicit or explicit barrier synchronization. That is, forces (or densities) are computed under the assumption that all data is static. A barrier is reached, after which the positions may be altered by the dynamics, followed by another barrier indicating that all positions have been updated and another force calculation, etc.
The parallel tree library distributes the data so that a particular result, e.g., a density estimate for a particular particle is computed by only one processor. The assignment of particles to processors must satisfy two competing goals: the processing load must be balanced, and the necessary interprocessor communication must be minimized. Load balancing of highly irregular data sets is achieved by sorting particle positions along a Peano-Hilbert curve (see Figure 7a), and then chopping the curve into equal pieces. This has the effect of achieving load balance while maintaining spatial locality (and thereby minimizing communication overhead).
Neighbor-finding (and hence evaluation of equation 2), is done by constructing an adaptive oct-tree with individual particles at the leaves and internal nodes corresponding to cubical regions of space. The tree is traversed once for each particle in time proportional to , so that all neighbors are found in time. In parallel, the tree is constructed with ``remote'' pointers to cells that are stored on other processors. When a particular traversal encounters such a pointer, a communication is initiated requesting the data from the remote processor. The substantial latency of such requests is overcome by a) bundling requests together into larger blocks before actually beginning interprocessor communication and b) working on another branch of the tree while the request is being processed. Furthermore, once requested, the contents of a ``remote'' pointer are stored locally, so that subsequent visits to the same cell do not incur the cost of interprocessor communication. This explains the value of spatial locality in the assignment of particles to processors. The neighbor-sets of nearby particles have a great deal of overlap, so that the communication cost of obtaining the neighbors from remote processors is amortized over all the local particles that are neighbors of the remote ones. All of the rather substantial bookkeeping involved is handled by the parallel tree library.
Upon completion of the density estimation phase, every particle has a list of other particles to which it is linked. Storing these lists for all particles at once would significantly increase memory requirements. Therefore, we traverse the tree multiple times rather than exploiting the ``obvious'' optimization of recording explicit lists of linked particles.
For each particle, we identify the highest density neighbor (HDN), i.e., highest density member of the set of particles to which the given particle is directly linked. Some particles are their own HDN. These are precisely the central particles which define for the tentative halos in the sequential formulation, and we give them a unique halo number.
Now we scan the list of particles until we have assigned a preliminary halo number to each particle. For each particle, we check its HDN. If its HDN has a preliminary halo number assigned, then the particle inherits that preliminary halo number. If the particle's HDN is not yet assigned, we revisit the particle again on the next iteration. Note that the graph defined by the neighbor relationship is highly interconnected (every particle is connected to others), so only a few iterations are required to propagate the preliminary halo numbers to the entire system. Some interprocessor communication is required at the start of each iteration to propagate new information about particles that have been assigned halo numbers on the previous iteration.
Next, we examine the set of particles and links again, and for each pair of preliminary halo numbers, we identify the highest-density particle that is assigned to one of the two halos, and is linked to a higher density particle that is assigned to the other halo. The particles identified in this way are the overlap particles from the sequential formulation. Note that most pairs of halos are not linked at all, so this step only requires space. It is easily accomplished by visiting all of the particle-particle links, and maintaining a sparse data structure that records the highest-density particle so far encountered that links each pair of halos. In parallel, when all processors are done with their own data, these local data structures are combined to produce a global structure.
Now it is possible to construct the tree of halos, where the leaves correspond to central regions of isolated halos and the internal nodes correspond to composite halos that meet at overlap particles. One sorts the overlap particles in order of decreasing density, and defines a new composite halo for each overlap. As composite halos are created, the overlap points merge, so that if A and B are merged into M, and they both overlapped with C, then M also overlaps with C, at the maximum of the two previous overlaps between A-C, and B-C. Noise suppression criteria can be implemented here as well, i.e., preliminary halos may be regarded as tentative until their density exceeds a statistical threshold over the background density, or until they pass the evaporative test. Construction of the tree of halos is, unfortunately, inherently sequential, because the overlaps must be treated in order of decreasing density. It is far faster than a naive implementation of the formulation in Section 4, though, because we have identified the overlap particles in parallel, and it is only the tree construction that is sequential. Furthermore, the tree construction only uses data related to the overlaps. The much larger particle data set can be ignored while the logical structure of the tree of halos is constructed from the maximum density overlaps. In practice, it is acceptably fast, even on highly parallel systems. See Section 6 for details of the scalability.
Once the tree of halos has been constructed, with density levels indicating where halos merge into one another, it is a simple matter to distribute the tree to the processors in a parallel system, and then assign each particle to a final, genuine halo in parallel. Starting with its preliminary halo number (at a leaf of the tree), one simply ascends the tree of halos (toward the root) until the overlap density is less than the particle's own density. The particle is assigned to the composite halo defined by this location in the tree.
Friends of friends can be implemented almost as a special case of IsoDen, using constant kernel and linking lengths, and using the number of friends as a measure of density. In the FOF case all halo overlaps result in the overlapping halos being merged together, and the substructure before the overlap is forgotten.