Hardware-assisted garbage collection

Hardware-assisted garbage collection is the use of specialized hardware mechanisms to improve the efficiency and performance of garbage collection in computer systems. This approach integrates hardware support directly into the processor or memory system to handle tasks traditionally managed by software, such as object allocation, reference counting, or mark-and-sweep operations. It is particularly relevant in real-time systems, embedded systems, and high-performance computing environments where software-only garbage collection may introduce unacceptable pauses or overhead.

History

Research into hardware-assisted garbage collection dates back to the 1990s. Early work focused on simulation studies to analyze the behavior of such systems. By the early 2000s, proposals emerged for integrating garbage collection into hardware for real-time embedded systems. Modern developments include accelerators for tracing garbage collection and concurrent collectors for functional languages.

Research in the 1980s included hardware features in Lisp machines. The Symbolics 3600 used a tagged architecture with per-word tags to distinguish pointers from other data, enabling efficient garbage collection operations in hardware and microcode. In the 1990s, Kelvin Nilsen and others advanced real-time hardware-assisted approaches through simulation and prototypes. A 1994 implementation demonstrated high throughput with bounded allocation and collection times suitable for hard real-time systems.

The concept of hardware-assisted garbage collection dates back several decades, with early implementations in specialized Lisp machines and research prototypes. Notable historical systems like The Garbage Collected Memory Module, a near-memory co-processor design or Azul Systems' Vega processors, which included hardware support for read barriers to enable pauseless garbage collection

While many early proposals did not achieve widespread adoption, renewed interest has emerged due to the slowing of Moore's law, the prevalence of garbage-collected languages, and the rise of cloud computing and hardware accelerators.

Mechanisms

Hardware assistance can involve dedicated instructions for memory allocation, reference tracking, or collection phases. For example, some architectures provide support for bitmap-marking or concurrent marking to reduce pauses. In cloud environments, hardware instructions accelerate garbage collection hotspots, improving performance by an order of magnitude.

Generational schemes, as noted in general garbage collection literature, can be enhanced with hardware support for real-time operations. Proposals include integrated hardware collectors that run continuously in the background for embedded systems.

Read and write barriers

Read barriers and write barriers are fundamental mechanisms used by garbage collectors to coordinate between application (mutator) threads and the collector. A write barrier is a fragment of code executed before or after every store operation to maintain collector invariants, such as tracking cross-generational references. Hardware can implement barriers more efficiently than software alone; barriers can be implemented either through additional compiler-inserted instructions or by leveraging hardware features such as memory protection.

Some systems, notably Lisp machines, provided hardware support for forwarding pointers, allowing both old and new addresses to be used interchangeably without explicit checks. Read barriers are needed by fewer collector algorithms than write barriers but may have greater performance impact since pointer reads are more frequent than writes.

Applications

This technology is explored in contexts like virtual machines for cloud computing, where middleware overhead is reduced. It is also relevant for non-strict functional languages with concurrent collectors. Hardware accelerators have been proposed for tracing garbage collection in modern architectures.

Implementations in the wild

Performance

In controlled benchmarks, off-loading the mark phase to a hardware accelerator cut GC time by 65Ã¢ÂÂ80% and total application time by up to 25%. Data-centre JVM studies reported consistent sub-1 ms tail-latency for 99.999 th percentile pauses on 256 GB heaps using Azul's C4 on Vega hardware.

Azul Systems and pauseless collection

Azul Systems developed one of the most commercially successful hardware-assisted garbage collection systems. Beginning in 2005, Azul shipped a pauseless garbage collector on its custom Vega hardware platform. Three successive generations of Vega systems relied on custom multi-core processors and a custom OS kernel to deliver the scale and features needed to support pauseless garbage collection.

The Azul architecture included a dedicated read barrier instruction that enabled a highly concurrent, parallel, and compacting garbage collection algorithm. This barrier allows the collector and mutator threads to progress in parallel, with a "self-healing" property ensuring that object references are corrected once and benefit all application threads. The C4 (Continuously Concurrent Compacting Collector) algorithm evolved from this work, using read barriers to support concurrent compaction, concurrent remapping, and concurrent incremental update tracing. C4 allows JVMs to scale to tens or hundreds of gigabytes in heap sizes while sustaining multi-gigabyte-per-second allocation rates.

Modern architectural support

While specialized garbage collection hardware was once the domain of niche machines, features in general-purpose processors are increasingly being adapted for this purpose.

ARM Memory Tagging Extension (MTE)

Introduced in the ARMv9 architecture, MTE provides hardware support for tagging memory regions. While primarily designed for security to prevent buffer overflows and use-after-free vulnerabilities, it is also utilized to accelerate garbage collection. By assigning 4-bit tags to memory granules, a collector can quickly identify the state of memory blocks, facilitating faster marking and sweep phases while ensuring that the mutator does not access stale memory.

Intel Linear Address Masking (LAM)

Intel's Linear Address Masking (LAM) allows software to use untranslated address bits of a pointer to store metadata. This feature is particularly beneficial for garbage collectors that require "pointer coloring," a technique where metadata about an object is encoded directly into its address. LAM allows the processor to ignore these metadata bits during address translation, enabling the collector to check the status of an object without performing additional memory lookups or bitmasking instructions.

RISC-V extensions

Research into the RISC-V ISA has led to the development of custom extensions like "Graceless," which implements a hardware-based tracing accelerator. This system offloads the traversal of object graphs to a dedicated unit near the memory controller. By processing the heap independently of the main CPU cores, it minimizes cache pollution and significantly reduces the duration of "stop-the-world" events in high-throughput applications.

Advantages and challenges

Hardware assistance offers reliable operation, higher performance, and minimal pauses, making it suitable for real-time systems. However, it may require custom hardware, limiting adoption in general-purpose CPUs.

References