Next: Data Structures Up: ParallelOut-of-core methods for Previous: Latency Tolerance and Bandwidth

The Page Abstraction

In order to use disk for dynamic storage, we must first devise a mechanism for referring to the data that resides on disk, and for transferring that data to and from memory. One option is simply to use virtual memory. It is a simple matter to allocate program space, e.g., with malloc, that far exceeds the capacity of memory. The OS is then responsible for paging data to and from a ``swap device'', typically a disk partition dedicated to this task. Unfortunately, the programmer has essentially no control over selection of which pages to move, or when to move them. Policies implemented by the OS have usually been designed for large multi-tasking, multi-user systems with load characteristics and requirements far different from our treecode. In addition, the OS generally lacks crucial information that could guide a better strategy. Unless page-hits are recorded by the hardware, or information is solicited from the program itself, the OS does not know which pages are heavily used and which ones are idle, making it impossible to make an intelligent choice about which pages to swap out. The madvise function available on some operating systems is an attempt to address this issue, but it is not universally supported, and even where it is supported, the advice may still be ignored or misunderstood. The standard POSIX interface provides several calls which can be used to transfer data to and from disk. For example, read, and write explicitly request data transfers. Alternatively, mmap and munmap create and destroy correspondences between addressable memory and data on disk. Whichever method we chose, the essential problem remains: devise an abstraction that strikes the right balance between clarity of expression and performance. We want to hide the details of reading and writing from disk, but control the swapping policy so that performance doesn't suffer.

We treat our disk-store as a sequence of fixed-size, sequentially numbered pages. In memory, we maintain a ``working set'' of copies of some of the disk-pages. The size of the working set and the size of the individual pages (a multiple of the underlying OS page size) are determined during runtime initialization. To refer to data in a page, we use PgRef, which returns a pointer to an in-memory copy of the page. The memory at that address will not be replaced by another page from disk until PgUnref is called, and even then, it will be kept as long as possible. PgRefs accumulate, so a page remains ``locked'' in memory as long as the number of PgRefs exceeds the number of PgUnrefs. Much of the data access in our program is read-only, so we wish to avoid unnecessary copies back to disk. PgUnref takes an argument which states whether the caller has dirtied the page by modifying any data. Dirty bits from multiple PgUnref calls are OR-ed together. Before a dirty page is replaced in memory, its contents must be written back to disk.

We use an ad hoc, but effective least-recently-used policy to select pages for replacement. Every time PgRef is called, it increments a global counter which acts as a timestamp, and places the value in a structure associated with the page being referenced. When we need to replace a page, we examine all unreferenced paged, and choose the one with the oldest timestamp. To reduce the number of writes, we preferentially replace clean pages by dividing their age in half before comparison.

This defines an interface to page-sized units of storage. But our program works with bodies, sib_groups, moments and so forth (see Section 4), none of which are the same size, and all of which are far smaller than an entire page. Thus, we introduce a second layer, of ``out-of-core pointers'', oocptrs, which are implemented as a structure containing a page number and an offset. We use OOCRef to return a true pointer that corresponds to an oocptr, and OOCUnRef to signal that we are done with an oocptr. Oocptrs appear in several places in the library where an in-core implementation might use normal pointers. Explicit page numbers and PgRefs, on the other hand, are generally hidden behind the oocptr abstraction.

Next: Data Structures Up: ParallelOut-of-core methods for Previous: Latency Tolerance and Bandwidth

John Salmon
Wed Jan 1 23:00:51 PST 1997