Fully-Concurrent Sorting GC

The Cangjie language provides the fully-concurrent memory mark sorting GC algorithm as the base of its automatic memory management technology. It features low latency, low memory fragmentation rate, and high memory utilization.

Reducing GC Suspension Time

In some latency-sensitive important scenarios, STW GC or approximate concurrent GC cannot meet technical specifications. For example, for the 120 Hz screen refresh rate (expected to be higher in the future) in mobile scenarios, the total time required for drawing a frame should be less than 8 ms. Millisecond-level GC suspension may become the main latency factor. In a high-concurrency scenario with thousands or even tens of thousands of concurrency, the approximate concurrency algorithm needs to scan thousands of call stacks in a single STW. This operation may prolong the STW time to more than 10 ms.

Compared with the existing STW GC and mostly concurrent GC (see the following figure), Cangjie's fully-concurrent GC abandons the STW as the GC synchronization mechanism and uses a lightweight synchronization mechanism with a shorter latency. The average time required for application threads to complete GC synchronization is less than 100 microseconds, and typically, GC synchronization can be completed within tens of microseconds. stwgc

The following two key elements are used to achieve efficient GC synchronization:

  • Security Point

Concurrent GC needs to process the status synchronization relationship between GC threads and application threads. The security point mechanism is a technical means used by the GC thread to control the application thread to implement GC status synchronization. The security point mechanism consists of two parts. One is the security point check code inserted when the compiler is compiling the Cangjie code, which is usually inserted on an explicit path, such as the function header or tail and loop back edge. The other is the security point synchronization logic implemented in the GC algorithm. When the GC thread needs to synchronize the GC status to a specific application thread, the GC thread first activates the security point check of the application thread. When the security point check code is executed in the application thread, the security point of the application thread is in the active status, and the application thread responds to the GC synchronization request and changes the GC status to a specified status. With the security point mechanism, the GC thread can control the change of the GC status of the application thread and work with the memory barrier to ensure that the concurrent GC can be correctly executed. Different GC statuses require corresponding memory barriers.

  • Memory Barrier

The fully-concurrent GC implemented by the Cangjie language also needs to correctly process three types of data competition relationships: (1) Data competition relationships between GC threads and application threads (2) Data competition relationships between GC threads (3) Data competition relationships imported by the GC between application threads. Because the application threads share some work of the GC. The memory barrier mechanism is used to resolve data competition between GC threads and Cangjie threads. In concurrent GC, the most common data competition is between GC threads and application threads. Example: the time when the GC thread wants to move a live object to a new address and the service thread wants to access the live object almost at the same time. The memory barrier mechanism is used to maintain coordination with a GC thread when an application thread accesses the memory.

Reducing Memory Fragments and Memory Peaks

The memory sorting technology used by the Cangjie GC is one of the most cutting-edge technologies for memory reclamation (certainly, the algorithm implementation complexity of the Cangjie GC is relatively increased). Compared with the existing Mark-Sweep algorithm, the memory sorting technology can be used to migrate scattered live objects to a more compact memory layout, greatly reducing memory fragments. Compared with the replication algorithm that copies all live objects in the from-space to the to-space before the memory of the from-space is released, the memory sorting technology can be used to migrate live objects and release the memory blocks that have been sorted at the same time, eliminating the memory peak caused by GC replication.

In the fully-concurrent GC of the Cangjie language, the Cangjie heap memory is divided into small continuous memory blocks, which are called region. The from-region is a region in the from-space, and the to-region is a region in the to-space. During heap memory sorting, live objects in the from-region are migrated to a proper to-region to form a more compact memory layout. After the migration is complete, the from-region can be released for allocation of new objects. When the memory pressure is high, the from-region and to-region can overlap in the memory space to further reduce the memory peak.

Fast Object Allocation Speed

Thanks to the object migration technology used by the Cangjie GC to reclaim memory, Cangjie runtime implements an object allocation mode based on the bumping-pointer technology, which is an existing advanced memory allocation technology. A memory allocation can be completed in about 10 clock cycles on average.

Optimizing GC Overhead

Cangjie's fully-concurrent memory sorting algorithm uses pointer mark to accelerate the identification of fast paths in memory access and storage operations. Pointer mark classifies statuses of reference members in Cangjie objects into three types: (1) Unmarked. (2) Marked: The reference member is marked in the current GC process. The pointer is called current pointer. (3) Marked: The reference member is marked in the last GC process. The pointer is called old pointer. The unmarked pointer is the real address of a Cangjie object. It can be directly used for memory access and storage, which is a fast path for memory access and storage. The old pointer comes from the last GC. The pointer that has not been repaired needs to be updated after the GC thread or memory barrier queries the forwarding table in the enum and trace phases. The current pointer is marked in the enum and trace phases of the current GC to indicate which objects may be moved. When the memory barrier accesses the marked old pointer, the forwarding table needs to be queried to obtain a new address. When the memory barrier accesses the marked current pointer, it needs to determine whether the object has been transferred based on the current GC status. If the object has been transferred, the forwarding table is queried to obtain a new address. Otherwise, the memory barrier needs to transfer the object to a corresponding to-region and update the marked pointer.

In the Cangjie's fully-concurrent memory sorting algorithm, marked pointers are updated in lazy mode. The marked pointers are updated only when they are accessed by a memory barrier. If the marked pointers are not accessed, they are retained and updated in the enum and trace phases of the next GC. This policy can significantly reduce the duration of the GC process.

Adapting Value Type

As described above, value types are introduced into the Cangjie language to provide developers with better memory and performance development choices, but also make the GC implementation more complex. For reference-only programming languages (such as Java, JavaScript, and Kotlin), all types except basic types are reference types. This mode is more friendly to the GC algorithm. Value types make the GC algorithm more complex in two aspects: (1) The data of a value type may be a member of other data, or may be an independent global or local variable. In the two cases, the data of the value type is in different memory areas. However, for member methods of the value type, a unified behavior definition must be provided in the above two cases. (2) The data of the value type belongs to other value types or reference types in the embedded form. The memory layout is changed with the embedded form. During the implementation of Cangjie GC that supports object migration, the complexity of the GC engineering is greatly increased due to features of value types. Considering that value types bring developers more powerful and convenient expression capabilities, better application performance, and better memory performance, the internal complexity increased in implementation of the Cangjie language is worthwhile.