0.7.1 Release

Thread safety

release

A couple weeks after the 0.7 release I’m excited to announce the first patch release of 0.7.x series.

The theme of this release is a single, yet important, improvement: Thread safety.

In this post I’m going to dive deeper in the architecture of this change.

JDK 21 and Virtual Threads

As usual, after I announced a new release on reddit I got a few positive and inspiring comments and feedbacks, but out of all messages, one single sentence stuck in my head:

The only thing I think you should consider is some switch to turn off thread locals given JDK 21 and virtual threads.

While the suggestion hinted at a switch, some sort of runtime-based mechanism that could decide between thread locals and something else, I think Penna benefits more from being simple and straightforward.

Why is this a problem?

Let’s take a step back and think critically about ThreadLocal and how virtual threads affect Penna. If we read the JEP 429, specially the section which mentions the Problems with thread-local variables, we can get a hint that, at least for Penna’s use-case, ThreadLocal can be an issue both by it’s Unbounded lifetime and Expensive inheritance characteristics.

Imagine this scenario: an app running Penna is now using Virtual Threads to handle the requests. For Penna, before, that meant that each Virtual Thread would trigger a cascade of allocations in Penna, as both Penna’s LoggingEventBuilder implementation and the code responsible for writing the JSON log message were attached to a ThreadLocal variable.

I can discuss the rationale for that in a separate post, but for now, it suffices to say that it was a pretty fair bet that it wouldn’t blow client application.

New technology, new challenges

Ok, so we’re dropping the ThreadLocal. How do we start?

We can do a “TDD-ish” approach here. Is it really a problem to run Penna on a virtual thread? JMH to the rescue…

@Benchmark
public void penna(PennaState state) throws InterruptedException {
    CountDownLatch latch = new CountDownLatch(state.threads);

    for (int i = 0; i < state.threads; i++) {
        int index = i;
         Thread.ofVirtual()
                .name("jmh-penna-", i)
                .start(() -> {
                    try {
                        var logger = state.getLogger("jmh.penna.task" + index);
                        logger.info("Message 1");
                        logger.info("Message 2");
                        logger.info("Message 3");
                        logger.info("Message 4");
                    } finally {
                        latch.countDown();
                    }
                });
    }

    latch.await();
}

What do I want to prove with this test? Well, if it breaks, than of course I have to fix the code, but also it gives me a chance to measure how much memory, GC and throughput Penna has in a close to real-world scenario: a virtual thread will acquire some logger - possibly a new one? - and will write a few messages before shutting down.

First attempt, with the existing code and… It failed miserably (no surprise, Sherlock).

With a single thread only, it was filling the heap size for the test (8gb) quickly.

So the first measure was to reduce the buffer size from 32kb to 2kb. Now it doesn’t die out of memory gluttony, but instead we reach a few barriers:

Logger acquisition can sometimes fail during the runs;
For a test run with 16 threads, not only it performs poorly, but it also runs indefinitely, never finishing the test run.

So, in a virtual thread environment, we were paying the cost of doing a significant number of allocations. The runs were reporting about 4k ops/sec for one thread only 60 ops/sec for run with 16 virtual threads.

First shot

The two main components in this setup are the LoggingEventBuilder and the Sink. The first because it contains a reuse object - cheaper than allocating a new object at every log call - which is cleaned and re-populated with the new logging information. The second because it contains the buffer where the json is written to before dumping it into stdout.

So, it feels natural to group then together. In fact, we can have an object pool in which a tuple-like structure can hold a reuse LoggingEvent and a Sink:

`1`	`public record LogUnitContext(Sink sink, PennaLogEvent logEvent) {}`

And an object pool for this object sounds like a trivial thing to build, right?

// We update the `LogUnitContext` to get a refence to the object pool
public record LogUnitContext(LogUnitContextPool pool, int selfReference, Sink sink, PennaLogEvent logEvent) {
    public void releaseBackToPool() {
        pool.release(selfReference);
    }
}

// And we create it in the object pool
public final class LogUnitContextPool {

    private final LogUnitContext[] objectGroup;
    private final BitSet allocations;

    private LogUnitContext leafObject(int index) {
        return new LogUnitContext(this, index, SinkManager.Instance.get(), new PennaLogEvent());
    }

    public LogUnitContextPool() {
        int size = 16;
        objectGroup = new LogUnitContext[size];
        for (int i = 0; i < size; i++) {
            objectGroup[i] = leafObject(i);
        }
    }

    public void release(int index) {
        allocations.clear(index);
    }

    public LogUnitContext get() {
        var index = allocations.nextClearBit(0);
        allocations.set(index);
        return objectGroup[index];
    }
}

So far, so good. Except, of course, we need to make this thread safe, otherwise two concurrent threads can update the allocations BitSet and they can get the same object at the same time. Also, we need to ensure we don’t go over our object pool size limit.

Synchronizing the method here wouldn’t help either as we need to lock on both acquiring and releasing the object from/to the pool.

Then, if we think about it, if we add a lock around the allocations BitSet, in the event many threads want to acquire a LogUnitContext object, we’ll make them wait for no reason.

Granular control

The final solution looks something like this:

public final class LogUnitContextPool {

    private final Lock[] locks;
    private final LogUnitContext[] objectGroup;

    private LogUnitContext leafObject(int index) {
        return new LogUnitContext(this, index, SinkManager.Instance.get(), new PennaLogEvent());
    }

    public LogUnitContextPool() {
        int size = 16;
        objectGroup = new LogUnitContext[size];
        locks = new ReentrantLock[size];
        for (int i = 0; i < size; i++) {
            objectGroup[i] = leafObject(i);
            locks[i] = new ReentrantLock();
        }
    }

    public void release(int index) {
        locks[index].unlock();
    }

    private int acquireLock() {
        var index = 0;
        while (!locks[index].tryLock()) {
            index = ++index & 0xF;
        }
        return index;
    }

    public LogUnitContext get() {
        var index = acquireLock();
        return objectGroup[index];
    }
}

Considering the fact that both arrays are the same size, we can now have better control of our contention as tryLock will only return true once we’re able to get hold of an object in the pool, which in turn will only return to us an index that this thread is able to hold. Releasing an object now doesn’t hold or affect acquiring an object. The layer above will, of course, need to make sure this object is released after the logs is written - or even if it fails to write.

This shows up in the JMH test, as that test is now able to run for 1, 16 and 256 threads without issues:

Benchmark                                                     (threads)   Mode  Cnt         Score       Error      Units
    CompleteUsecasePerformanceTest.penna                                  1  thrpt    3     48389.452 ± 20980.696      ops/s
    CompleteUsecasePerformanceTest.penna:dTLB-stores:u                    1  thrpt           5990.455                   #/op
    CompleteUsecasePerformanceTest.penna:gc.alloc.rate                    1  thrpt    3        61.279 ±    26.503     MB/sec
    CompleteUsecasePerformanceTest.penna:gc.alloc.rate.norm               1  thrpt    3      1327.991 ±     0.647       B/op
    CompleteUsecasePerformanceTest.penna:gc.count                         1  thrpt    3         6.000                 counts
    CompleteUsecasePerformanceTest.penna:gc.time                          1  thrpt    3         4.000                     ms
    CompleteUsecasePerformanceTest.penna                                 16  thrpt    3      7974.001 ±  1657.137      ops/s
    CompleteUsecasePerformanceTest.penna:gc.alloc.rate                   16  thrpt    3       152.502 ±    31.521     MB/sec
    CompleteUsecasePerformanceTest.penna:gc.alloc.rate.norm              16  thrpt    3     20054.613 ±    41.570       B/op
    CompleteUsecasePerformanceTest.penna:gc.count                        16  thrpt    3        15.000                 counts
    CompleteUsecasePerformanceTest.penna:gc.time                         16  thrpt    3         9.000                     ms
    CompleteUsecasePerformanceTest.penna                                256  thrpt    3       692.376 ±  4654.997      ops/s
    CompleteUsecasePerformanceTest.penna:gc.alloc.rate                  256  thrpt    3       310.801 ±  2105.448     MB/sec
    CompleteUsecasePerformanceTest.penna:gc.alloc.rate.norm             256  thrpt    3    470272.697 ± 32649.345       B/op
    CompleteUsecasePerformanceTest.penna:gc.count                       256  thrpt    3        31.000                 counts
    CompleteUsecasePerformanceTest.penna:gc.time                        256  thrpt    3        22.000                     ms

For a single thread it shows a 10x improvement over the ThreadLocal version and for 16 threads it is more than 133x faster. The numbers, however, start to go down because of contention further down the chain, when the JVM tries to offload the buffers into stdout.

What’s next?

Penna 0.7.1 is now released with this change and I hope to get more feedback soon. There are a ton of things that can be improved, but this makes Penna 0.7.1 effectively ready for JDK 21.

One of the things I experimented with but didn’t commit to change yet is how the lock is acquired. By always starting at 0 we don’t benefit from the ring buffer that much. Starting off from a random position showed to be better for a single thread (or rather, while stdout wasn’t disputed by other threads), so it could be a way forward. Another alternative would be to use an atomic integer and store the values within runs, but this might add more overhead than necessary.

A bonus, hidden ThreadLocal issue “fixed”

Another ThreadLocal instance that was touched upon, but unfortunately still not removes was the MDC storage. That is by design, as MDC data is inherently thread-local.

However, when looking at the flamegraph for those tests, a wild ThreadLocal popped out, even when MDC was not set:

JFR showing MDC taking time unnecessarily

So that was also re-engineered to avoid creating new objects. This nice little thig was made possible by using a sum type from a sealed interface:

public sealed interface Mdc extends MDCAdapter {

    final class Inner {
        private static final EmptyMdc empty = new EmptyMdc();
        public static final ThreadLocal<Mdc> mdcStorage = ThreadLocal.withInitial(() -> empty);

    }

    final class EmptyMdc implements Mdc {
        @Override
        public void put(String key, String val) {
            Inner.mdcStorage.set(new MdcStorage(key, val));
        }

        // Other methods
    }

    final class MdcStorage implements Mdc {
        @Override
        public void clear() {
            Inner.mdcStorage.set(Inner.empty);
        }

        // Other methods
    }
}

With this setup, the EmptyMdc instance in Mdc.Inner becomes a singleton object that, upon modification request (i.e. put as shown above), will update the ThreadLocal with a new instance of MdcStorage, contianing the new key-value pair. The storage, upon clear request, can also replace itself with the singleton proxy object.

This doesn’t solve needing the ThreadLocal yet, but makes it cheaper by not creating unnecessary objects inside the ThreadLocal.

JFR showing MDC still taking time unnecessarily, but without object creation

A much better approach for JDK 21 would be to use ScopedValues, as proposed by JEP 429 and passing the values in the fluent API’s .addKeyValue() method:

class SomeClass {
    final static ScopedValue<String> REQUEST_ID = ScopedValue.newInstance();
    final Logger logger = LoggerFactory.getLogger(SomeClass.class);

    void doSomething() {
        logger.atInfo()
            .addKeyValue("requestId", REQUEST_ID.get())
            .log("Doing something");
    }
}

Final words

This is a bit different than the previous posts as it went on deeper into the approach Penna adopted to replace ThreadLocal and be prepared for JDK 21. I’d really appreciate some feedback, questions or comments.

Penna is shaping up greatly at each release and your feedback is a huge part of its improvement. Don’t refrain from opening an issue, starting a discussion or sending me a message!

Stay tuned for more!