… and some notes about its hash code.

String is a one of the fundamental classes essential for Java. In 2014 (JDK 8u20), String instances consumed about 25% of heap. java.lang.String is a common hero of average Java Application. It’s worth to get to know this hero better.

This article consists of four sections:

Antiquity – if you enjoy archaeology – HotSpot (Java HotSpot Virtual Machine) 1.0.2 – 1.4
Middle Ages – HotSpot 5, OpenJDK 6 – 8
Modern History – OpenJDK 9 – 16
Contemporary History – OpenJDK 17 – 25

Feel free to scroll to the parts that interest you.

Antiquity

HotSpot 1.0.2

HotSpot 1.0.2 is first stable version of this Java Virtual Machine. Initially java.lang.String class contained the following fields:

public final class String {
    /** The value is used for character storage. */
    private char value[];

    /** The offset is the first index of the storage that is used. */
    private int offset;

    /** The count is the number of characters in the String. */
    private int count;
}

Comment

As we can see, the value is stored in an array of char – the 2-byte primitive. Initially, Java used UCS-2 encoding, a Unicode-based 2-byte fixed-width encoding.

The offset and count fields are useful for sharing the same char array across multiple String instances especially when using substring method.

Hashing

In 1.0.2 version, the hashCode() method computed the hash basing on a few sampled characters of the String. The time complexity of this method was O(1). This ensured efficient operations on Hashtable (a predecessor of HashMap). However, due to limited number of characters affecting the hash, the probability of hash collision was high.

HotSpot 1.1

There were no changes in fields comparing to previous version.

HotSpot 1.2

public final class String implements java.io.Serializable, Comparable {
    /** The value is used for character storage. */
    private char value[];

    /** The offset is the first index of the storage that is used. */
    private int offset;

    /** The count is the number of characters in the String. */
    private int count;

    /** If non-zero, cached hash code for this string. */
    private transient int hash;
}

The String class gained one more field (hash) and implemented one more interface (Comparable).

Hashing

The implementation of hashCode() method changed – now all characters are involved in the calculation, so the hash collision probability dropped. However, the computational complexity became O(n).

Hotspot 1.3

public final
class String implements java.io.Serializable, Comparable {
    /** The value is used for character storage. */
    private char value[];

    /** The offset is the first index of the storage that is used. */
    private int offset;

    /** The count is the number of characters in the String. */
    private int count;

    /** Cache the hash code for the string */
    private int hash = 0;
}

Hotspot 1.4

No changes

Middle Ages

Hotspot 5.0

public final class String
    implements java.io.Serializable, Comparable<String>, CharSequence
{
    /** The value is used for character storage. */
    private final char value[];

    /** The offset is the first index of the storage that is used. */
    private final int offset;

    /** The count is the number of characters in the String. */
    private final int count;

    /** Cache the hash code for the string */
    private int hash; // Default to 0
}

The main change we can notice is that the value, offset and count fields are marked as final.

The question is – why weren’t they final before? This remains a mystery, but the fact is the final keyword was rarely used in earlier versions of Hotspot.

JSR-204

From Java 5.0, String began using UTF-16 instead of UCS-2. This means, that some characters are encoded on 4 bytes, so sometimes two chars are needed to represent a single character. As a result, char no longer represents a single character – now the int does, and it’s called a code point.

OpenJdk 6

No changes.

OpenJdk 7

No changes.

OpenJdk 8 – removed count and offset

public final class String
    implements java.io.Serializable, Comparable<String>, CharSequence {
    /** The value is used for character storage. */
    private final char value[];

    /** Cache the hash code for the string */
    private int hash; // Default to 0
}

After almost 20 years, it turned out that the sharing value table has also drawbacks.

Prior to OpenJDK 8, when the substring() method is invoked, the value table was shared – only index and count were stored in the String. This was a kind of optimisation: allocatting of a new array was avoided, reducing CPU work and memory usage.

However, this approach has it’s performance drawbacks. In contrary to normal String allocation, where the value table is located next to the String object, the result of substring() could be stored in different segment of memory.

It had even bigger disadvantage – it can lead to a form of memory leak. For example, if a large XML document stored in String and only a small substring is used store as key in HashMap, the Garbage Collector can reclaim the original String object,but not the underlying array, since it’s shared with our little key in HashMap. This could be worked around by explicite constructor invocation, but the default behavior may be supprising.

That’s why starting with OpenJDK 8, the count and offset fields were removed. Now, each substring() call creates copy of selected characters.

Modern History

OpenJdk 9 – compact Strings, `@Stable`

public final class String
    implements java.io.Serializable, Comparable<String>, CharSequence {

    /**
     * The value is used for character storage.
     *
     * @implNote This field is trusted by the VM, and is a subject to
     * constant folding if String instance is constant. Overwriting this
     * field after construction will cause problems.
     *
     * Additionally, it is marked with {@link Stable} to trust the contents
     * of the array. No other facility in JDK provides this functionality (yet).
     * {@link Stable} is safe here, because value is never null.
     */
    @Stable
    private final byte[] value;

    /**
     * The identifier of the encoding used to encode the bytes in
     * {@code value}. The supported values in this implementation are
     *
     * LATIN1
     * UTF16
     *
     * @implNote This field is trusted by the VM, and is a subject to
     * constant folding if String instance is constant. Overwriting this
     * field after construction will cause problems.
     */
    private final byte coder;

    /** Cache the hash code for the string */
    private int hash; // Default to 0
}

`coder`

After 20 years, it turned out, that 2 (or 4) bytes per character is a little too much. The research indicates that the majority of text can be represented with single-byte characters encoded with ISO-8859-1 (Latin-1). Therefore, a new field (coder) was introduced, and char array value was replaced with a byte array.

Visit JEP-254 for more details.

`@Stable`

As mentioned earlier, the final keyword does not ensure, the field will never change. However, such ensuring would enable some performance optimizations.

The @Stable annotation ensures that once a field has non-default value (null, 0 or false), it is considered as constant. For arrays, @Stable guarantees that each element is also stable (constant for non-default value).

The @Stable annotation enables constant-folding and could be helpful for other JIT optimizations.

OpenJdk 12 – `Constable`, `ConstantDesc`

public final class String
    implements java.io.Serializable, Comparable<String>, CharSequence,
               Constable, ConstantDesc {

    /**
     * The value is used for character storage.
     *
     * @implNote This field is trusted by the VM, and is a subject to
     * constant folding if String instance is constant. Overwriting this
     * field after construction will cause problems.
     *
     * Additionally, it is marked with {@link Stable} to trust the contents
     * of the array. No other facility in JDK provides this functionality (yet).
     * {@link Stable} is safe here, because value is never null.
     */
    @Stable
    private final byte[] value;

    /**
     * The identifier of the encoding used to encode the bytes in
     * {@code value}. The supported values in this implementation are
     *
     * LATIN1
     * UTF16
     *
     * @implNote This field is trusted by the VM, and is a subject to
     * constant folding if String instance is constant. Overwriting this
     * field after construction will cause problems.
     */
    private final byte coder;

    /** Cache the hash code for the string */
    private int hash; // Default to 0
}

Both Constable and ConstantDesc were introduced mainly for compilers, tools, and JVM languages. Their purpose is to provide metadata describing constants.

They likely have little to no direct impact on performance.

OpenJdk 13 – `hashIsZero`

public final class String
    implements java.io.Serializable, Comparable<String>, CharSequence,
               Constable, ConstantDesc {

    /**
     * The value is used for character storage.
     *
     * @implNote This field is trusted by the VM, and is a subject to
     * constant folding if String instance is constant. Overwriting this
     * field after construction will cause problems.
     *
     * Additionally, it is marked with {@link Stable} to trust the contents
     * of the array. No other facility in JDK provides this functionality (yet).
     * {@link Stable} is safe here, because value is never null.
     */
    @Stable
    private final byte[] value;

    /**
     * The identifier of the encoding used to encode the bytes in
     * {@code value}. The supported values in this implementation are
     *
     * LATIN1
     * UTF16
     *
     * @implNote This field is trusted by the VM, and is a subject to
     * constant folding if String instance is constant. Overwriting this
     * field after construction will cause problems.
     */
    private final byte coder;

    /** Cache the hash code for the string */
    private int hash; // Default to 0

    /**
     * Cache if the hash has been calculated as actually being zero, enabling
     * us to avoid recalculating this.
     */
    private boolean hashIsZero; // Default to false;
}

After almost 25 years, the edge case of zero hash codes was finally solved. Previously, when a String had a hash of 0, the cached value was indistinguishable from an uninitialized hash. So the hash field as cache for zero-hashed String was useless.

The solution was to introduce a new field, hashIsZero, whose value is set lazily – on the hashCode()method invocation. As the most common memory alignment is 8 bytes, the object memory footprint remained the same (24 bytes per each String object).

Although this may seem like a rare edge case, such strings are very easy to generate. This drawback can be used to execute DOS attack. It’s simple – just send multiple headers (a lot of headers) with zero-hashed String. In example server using JavaEE interface to handling.

For more details about zero-hashed Strings visit here.

Contemporary History

OpenJDK 17

No changes.

OpenJDK 21 – vectorized `hashcode()`

JDK 21 introduced a new performance enhancement for hash computation. Previously, the hash code was computed in a simple loop for each byte of value array. However, it turned out that some of the computation can be executed in parallel. This way was implemented in JDK 21 (in jdk.internal.ArraysSupport::vectorizedHashCode()method, marked with jdk.internal.vm.annotation.IntrinsicCandidate annotation). When String.hashcode() is invoked after the JIT compiles this method, it begins to use specialized implementation of ArraySupport::vectorizedHashCode() optimized at machne-code level, utilising SIMD instructions. The results vary depending on the CPU executing this code, but on my old 2019 Intel-based MacBook, the hashcode() invoked using JDK21 was eight times faster than when using JDK20.

As computing a string’s hash code is very common operation, upgrading to JDK21 appears to be low-hanging performance fruit.

OpenJDK 25

public final class String
    implements java.io.Serializable, Comparable<String>, CharSequence,
               Constable, ConstantDesc {

    /**
     * The value is used for character storage.
     *
     * @implNote This field is trusted by the VM, and is a subject to
     * constant folding if String instance is constant. Overwriting this
     * field after construction will cause problems.
     *
     * Additionally, it is marked with {@link Stable} to trust the contents
     * of the array. No other facility in JDK provides this functionality (yet).
     * {@link Stable} is safe here, because value is never null.
     */
    @Stable
    private final byte[] value;

    /**
     * The identifier of the encoding used to encode the bytes in
     * {@code value}. The supported values in this implementation are
     *
     * LATIN1
     * UTF16
     *
     * @implNote This field is trusted by the VM, and is a subject to
     * constant folding if String instance is constant. Overwriting this
     * field after construction will cause problems.
     */
    private final byte coder;

    /** Cache the hash code for the string */
    @Stable
    private int hash; // Default to 0

    /**
     * Cache if the hash has been calculated as actually being zero, enabling
     * us to avoid recalculating this. This field is _not_ annotated @Stable as
     * the `hashCode()` method reads the field `hash` first anyhow and if `hash`
     * is the default zero value, is not trusted.
     */
    private boolean hashIsZero; // Default to false;
}

The story is simillar to the value field – as value of hashcode never changes, we can assume, the value of this field do not change over the time. However, to benefit from change, the String needs to be hold by other object that benefits from @Stable e.g. Map obtained by Map.of(stringKey, value)method invocation.

How much is this optimization worth? In the Inside Java article, the specific micro-scale operation became 8 times faster.

Future

Lilliput

The main goal of Lilliput project is to shrink object header from 96 to 64 bits. The first results appeared in JDK25 with JEP-519, however it’s not enabled by default.

Supprisingly, it do not affect String itself – due to memory layout of objects, the object size is rounded to multiples of 8 bytes. In case of Strings, the memory needed to store an String object is header (8B with or 12B without Lilliput) + size of fields (assuming compressed oops 4B for pointer to value array + 1B for coder + 4B for hash + 1B for hashIsZero), which rounded to multiples of 8 gives 24B for each String object.

However, it may affect the array holded by value field, so the future is still bright 😉 And there are plans to shrink object header even to 32bits, and that will do affect Strings.

Valhalla

In Valhalla project, the main goal is to introduce value objects. These object will be immutable, value-based, without identity.

It would be expected, that in the future, String will also become value object. However there are some challenges – e.g. currently there are no frozen arrays in Java or that not all of String fields are final. (all challenges are listed in the stackexchange thread, I recommend to read.

Jakub Gardo

All fields of String…

… and some notes about its hash code.

Antiquity

HotSpot 1.0.2

Comment

Hashing

HotSpot 1.1

HotSpot 1.2

Hashing

Hotspot 1.3

Hotspot 1.4

Middle Ages

Hotspot 5.0

JSR-204

OpenJdk 6

OpenJdk 7

OpenJdk 8 – removed count and offset

Modern History

OpenJdk 9 – compact Strings, `@Stable`

`coder`

`@Stable`

OpenJdk 12 – `Constable`, `ConstantDesc`

OpenJdk 13 – `hashIsZero`

Contemporary History

OpenJDK 17

OpenJDK 21 – vectorized `hashcode()`

OpenJDK 25

Future

Lilliput

Valhalla

Hello World!

… and some notes about its hash code.

Antiquity

HotSpot 1.0.2

Comment

Hashing

HotSpot 1.1

HotSpot 1.2

Hashing

Hotspot 1.3

Hotspot 1.4

Middle Ages

Hotspot 5.0

OpenJdk 8 – removed count and offset

Modern History

OpenJdk 9 – compact Strings, @Stable

coder

@Stable

OpenJdk 12 – Constable, ConstantDesc

OpenJdk 13 – hashIsZero

Contemporary History

OpenJDK 21 – vectorized hashcode()

Future

Lilliput

Valhalla

OpenJdk 9 – compact Strings, `@Stable`

`coder`

`@Stable`

OpenJdk 12 – `Constable`, `ConstantDesc`

OpenJdk 13 – `hashIsZero`

OpenJDK 21 – vectorized `hashcode()`