Realizing the software of the future...today

Serialization API: A Simple Reference Architecture for Object Serialization

In part 1 of this series, I defined the desirable attributes for a serialization framework as performant, lean, and high quality. In part 2, I explored the available option space in Java. I dove deep to analyze Java IO, Gson, Jackson, Johnzon, and Java IO for leanness and quality. During that analysis, I discovered a shared common architecture of using the combination of a FACADE and STRATEGY to implement serialization libraries as seen in Gson, Jackson, Johnzon, and Kryo. In this part, I seek to formalize the reference architecture with a common API. I then use this API to implement approaches both analogous to Java IO and to the studied libraries. In a subsequent piece I will compare the performance of both of these approaches with available options and determine the viability of the approach as a replacement for general usage.

Designing with design in mind

The Problem of Serializable

The first question one might ask when designing a serialization framework is “how do users mark objects for processing by the framework?” This might be a more controversial question than it appears. Java IO uses a MARKER INTERFACE. Kryo requires registration through its top-level FACADE. Gson, Jackson, and Johnzon don’t require any special steps at all, handling almost any Plain Old Java Object. That’s the easiest option for end users. So isn’t that approach the best?

Nothing at all?

I don’t think so. One of the convincing arguments that emerges from both Joshua Bloch’s treatment of serialization in Effective Java and the points raised by Marks and Goetz in “What we hate about Serialization and what we might do about it” is that designing serialized objects is different than designing other types of objects. Special consideration has to be given to the serialized form as part of the API of such objects, which should be documented and managed differently.

While it is not necessarily a framework's job to write code for its users–though some might disagree– a helpful framework applies certain restrictions that make it “easy to do the right thing.” Treating all objects as potentially serializable fails in this regard. It leaves it to the programmer to figure out that serialized objects require special care. This potentially does them more of a disservice in the long run. A framework can help by enforcing some type of “opt-in” demarcation for these special types of objects.

The best Serializable objects are simply a special form of structures that carry data to an external system, in this case persistence to disk or across the network. Martin Fowler formalized the differences between DOMAIN OBJECT and VALUE OBJECT/DATA TRANSFER OBJECT in Patterns of Enterprise Application Architecture, while Eric Evans describes the difference between ENTITIES and VALUE OBJECTS in Domain Driven Design. Yet the only construct commonly used to realize these in code is a Class.

Essentially the Java language has been missing the ability to delineate between classes whose function is to manage runtime system resources–like java.util.concurrent.ExecutorService and java.nio.file.FileSystem –from data carriers such as a user-defined Person with fields such as name, birthdate, and other metadata. One could argue that this delineation has been solved with the introduction of Records, which similarly to structs in C# map cleanly to VALUE OBJECT.

Simply mandating the use of Records is one potential approach a serialization framework could employ, though again it would imply that all Records should be serializable. Even Java IO doesn’t go that far. In discussing Simpler object and data serialization using Java records Boes and Hegarty note that Records still require the use of the java.io.Serializable MARKER INTERFACE.

Marker Interface? More powerful interface? Abstract class?

Yet as Bloch pointed out in Effective Java, the MARKER INTERFACE isn’t even used for any type constraining in java.io.ObjectOutputStream. Arguably a new serialization framework could follow the MARKER INTERFACE approach and provide the compile-time checking that Java IO originally missed. Could it do more?

Serialization essentially transforms an object to bytes with a specified form. A modern Serializable interface could incorporate a method such as byte[] toBytes() and return the serialized form. The problem is that deserialization would require a method like static <T> T fromBytes(byte[] bytes). Java interfaces cannot force implementers to have a STATIC FACTORY METHOD. As an instance method on the interface, this would require the instantiation of the Serializable object in an incomplete state before calling, which breaks immutability and is ugly.

Arguably making a modern Serializable an abstract class defining a constructor with a byte[] argument that children must override would work. The drawback is that this approach would force Serializable classes to have no direct ancestors. This would preclude Records and Enums from participation in the framework. While a MARKER INTERFACE does provide compile-time type checking that a framework author can use to surface usage errors and nudge developers towards thinking about serializable objects differently, a compromise approach that is similar to the style of Gson/Jackson/Johnzon would use a Marker Annotation.

Marker Annotation?

A MARKER ANNOTATION allows for the same runtime check as happens in Java IO while providing similar flexibility to Gson/Jackson/Johnzon. Additionally, Marker Annotations do not force inheritance in a way a MARKER INTERFACE does. If an object is annotated as @Serializable and another object extends it, the extending object does not necessarily have to inherit the annotation if we do not define it as such.

Finally, Marker Annotations tie in well with Java Annotation Processing. Annotation Processing is an interesting and often under-used feature of the Java Programming Language that enables developers to generate code at compile time. For serialization, this can obviate the need for reflection. Translating an object into XML/JSON/YAML by a strategy such as “include the field’s name and its value in the output string” is completely determinable during annotation processing. Yet every library I inspected uses Reflection with some sort of field scraping as the default mechanism for serialization. Reflection is widely considered to be slow, though a lot of compiler and platform work has gone into the JVM to speed up reflection and perceptions of speed should always be tested instead of assumed. But if Google’s construction of Dagger–after making Guice as an alternative to Spring–for Dependency Injection is any indicator, the benefits of annotation processing to produce faster and simpler code is worth exploring.

Decision

Given the flexibility of Marker Annotations and their potential usage in code generation, I decided to go with a Marker Annotation. Implementing STRATEGY is fairly straightforward. Leveraging Java Generics allows for a bit more type safety, though it did make me second guess my decision not to use a MARKER INTERFACE which could’ve been used as a wildcard bound. The framework can be represented simply in a UML diagram and the following Java code.

Class Diagram for Serialization API
The Serialization API consisting of a Serializable annotation, a class T marked with the annotation, a SerializationStrategy<T> interface.
@Serializable
@Serializable
SerializationStrategy
SerializationStrategy<T>

Implementing the Design

Java IO and Gson/Jackson/Johnzon/Kryo offer two different implementation approaches. Naively, both seem viable, though implementation experience may prove why the FACADE/STRATEGY approach has come to be the dominant paradigm. But for the sake of proving the generalizability of the serialization api, I create an implementation following each approach.

Java IO

Java IO implements extension classes of java.io.OutputStream/java.io.InputStream especially crafted for objects, java.io.ObjectOutputStream/java.io.ObjectInputStream. They can read and write using void writeObject(Object o)/Object readObject(). This approach complects serialization with writing the result to an underlying stream. Such violation of the Single Responsibility Principle is arguably less flexible and composable than a simple functional transformation, though composition is achievable through DECORATOR as in the original IO API design.

java.io.ObjectOutputStream/java.io.ObjectInputStream are both rather large classes of 2493/4134 lines of code each (57/42 documented public methods). They clearly show responsibilities which a modern design can delegate to the SerializationStrategy. A minimal modern implementation in this style simply contains a map of types to their corresponding strategy and dispatches accordingly. Populating the strategies can be done in the constructor, either through arguments or using Java’s service provider interface, java.util.ServiceLoader. A class diagram of the implemented solution is shown below.

A Java Streams based implementation
A Java IO inspired implementation. AnnotationBasedObjectInputStream<T> and AnnotationBasedObjectOutputStream<T> use a shared SerializationProtocol<T>.

FACADE/STRATEGY

This library is actually even easier to write. It’s not really necessary to copy all of the methods or match the different configuration options offered by Gson/Jackson/Johnzon/Kryo, each reflects the design particulars of its model which may not be generalizable. The essential piece is that the FACADE delegates to an appropriate STRATEGY for each type it receives. This may tempt an implementer to jump straight to using a java.util.Map or a java.util.concurrent.ConcurrentHashMap immediately, but it’s worth taking a step back and analyzing the precise need.

A Map is a data structure designed to associate keys and values and provide fast lookups--O(1) in many cases-- for looking up a value with the key. The java.util.Map interface provides a lot more than that functionality, however. Applying the Interface Segregation Principle, one can see that the bare minimum that the FACADE needs is akin to a java.util.Map#get for a SerializationStrategy given a Type.

Using a java.util.function.Function<Type,Optional<SerializationStrategy<?>>> provides space for an even faster solution to be used. Deep within the guts of Kryo, I discovered the use of Fibonacci Hashing in a pre-allocated array to offer a significant performance advantage over a Map lookup. It’s fairly easy to replicate that approach or any other I might want by keeping the dependency as minimal and close to the “for a type, return its serialization strategy if there is one” specification as possible.

Another interesting question is whether the framework should apply any sort of delimiting or framing to the data. In string based encodings like JSON and XML, the format itself provides conventions for start and end delimiters. In a binary format, like what Kryo uses, it’s possible the stream of bytes to deserialize could be corrupted or contain multiple objects. A simple content boundary can help the framework detect data corruption and allow for multiple objects to be stored and processed from a single file.

I implemented the simplest possible version of such a library with the following UML diagram.

A FACADE/STRATEGY based implementation
A FACADE/STRATEGY based implementation. Loial is the FACADE, which is loaded with a Function<T, Optional<SerializationStrategy<T>>> that can dispatch an object T to its SerializationStrategy.

Is this generalizable?

Can it be used as a “serialize4j?”

Yes. Although such a short answer may sound glib, this reference architecture is the bare minimal set of components useful for a serialization framework. Should “bare minimum” be a goal? This is a philosophical question, but I will defend it.

Frameworks have traditionally been built by an “everything and the kitchen sink” approach. This often drives programmers away from them as developers gain experience and comfort in a domain. Work such as OSGI, Project Jigsaw, and a wide variety of projects in the Java world have moved from bigger projects to smaller, with modules providing performance through smaller runtimes and maintainability and security benefits through better encapsulation. This phenomenon is predicted rather well by Robert Martin in Agile Software Development when discussing the Stable Abstractions Principle and Distance from the Main Sequence. Unstable dependencies create problems and hotspots for change, while stable dependencies can form the basis for communication and interoperability. Hence packages that have been “too large” are refactoring towards better modularity and a greater separation of interface from implementation–which has been a part of programming best practice since C header files and implementation files. It’s very difficult to define a more abstract and stable contract in Java than an annotation and interface with no behavior.

Using any existing serialization library with this approach is trivial for any experienced software engineer. It’s so straightforward it’s not worth a full code example. Simply implement SerializationStrategy<Object> and delegate to the framework object in the body. This is essentially what Robert Martin calls Clean Boundaries in Clean Code or Eric Evans refers to as an Anti-Corruption Layer in Domain Driven Design.

Can it be used outside of Java?

Any language that provides abstractions similar to Java annotations and interfaces should be able to implement this architecture. Languages that don’t quite map can adapt. For instance, Ruby doesn’t have built-in annotations, but mixins are similar to Java interfaces. Future work may look at implementing this architecture in other languages and comparing it with their solutions.

Trial By Fire

After implementing the reference architecture with two different approaches, I pleasingly found it amenable to each with minimal fuss. In order to answer the question of “is it worth it?”, I need to analyze the results with the same methodology I did for existing options. This includes an evaluation of the leanness, quality, and performance of these implementations. I focus on that in subsequent work.