Skip to content

How to Use Java 26 Vector API for SIMD Performance

Problem

I have a numerical computation that processes millions of values. My CPU supports SIMD (Single Instruction, Multiple Data), but Java code can’t take advantage of it. Each operation processes one value at a time.

Here’s my scalar code:

ScalarComputation.java
// Process one value at a time
void scalarComputation(float[] a, float[] b, float[] c) {
for (int i = 0; i < a.length; i++) {
c[i] = -(a[i] * a[i] + b[i] * b[i]);
}
}

My CPU has AVX2 instructions that can process 8 floats at once. But Java can’t use them.

Environment

  • Java 26 (non-LTS)
  • x86_64 CPU with AVX2 support
  • Numerical processing workloads

What Is Vector API?

Vector API (JEP 529) is in its eleventh incubator release. It lets you write SIMD operations in pure Java:

Scalar vs SIMD Processing
Scalar (1 value at a time):
┌───┐
│ a │ → multiply → add → negate → ┌───┐
└───┘ │ c │
┌───┐ └───┘
│ b │ → multiply ──────────────────→
└───┘
SIMD (8 values at once):
┌───────────────────────┐
│ a0 a1 a2 a3 a4 a5 a6 a7 │ → multiply → add → negate → ┌───────────────────────┐
└───────────────────────┘ │ c0 c1 c2 c3 c4 c5 c6 c7 │
┌───────────────────────┐ └───────────────────────┘
│ b0 b1 b2 b3 b4 b5 b6 b7 │ → multiply ─────────────────────────────────────────→
└───────────────────────┘

Same operations, but 8x throughput.

How to Use Vector API

First, add the incubator module:

Run with Vector API
java --add-modules jdk.incubator.vector -jar myapp.jar

Here’s the vectorized version:

VectorComputation.java
import jdk.incubator.vector.FloatVector;
import jdk.incubator.vector.VectorSpecies;
public class VectorComputation {
// Define vector size (256-bit = 8 floats)
static final VectorSpecies&lt;Float&gt; SPECIES = FloatVector.SPECIES_256;
static void vectorComputation(float[] a, float[] b, float[] c) {
int i = 0;
int upperBound = SPECIES.loopBound(a.length);
// Process 8 floats at a time
for (; i < upperBound; i += SPECIES.length()) {
var va = FloatVector.fromArray(SPECIES, a, i);
var vb = FloatVector.fromArray(SPECIES, b, i);
var vc = va.mul(va).add(vb.mul(vb)).neg();
vc.intoArray(c, i);
}
// Handle remaining elements (not divisible by 8)
for (; i < a.length; i++) {
c[i] = -(a[i] * a[i] + b[i] * b[i]);
}
}
}

Key concepts:

  • VectorSpecies defines the vector width (256-bit = 8 floats)
  • FloatVector.fromArray() loads 8 floats into a vector register
  • Operations like mul(), add(), neg() work on the entire vector
  • loopBound() ensures we don’t read past array bounds

What Happens Under the Hood

The JVM compiles vector operations to CPU-specific instructions:

Compilation Flow
Java Vector Code
JIT Compiler
├─→ x86_64 with AVX2 → vmulps, vaddps, vnegps
├─→ x86_64 with AVX-512 → 16 floats at once
└─→ ARM with NEON → 4 floats at once
Same Java code, optimal native instructions

No JNI. No native code. The JVM handles platform differences automatically.

Why Eleventh Incubator?

Eleven iterations might seem excessive. But SIMD is tricky:

  • Different CPUs have different vector widths
  • Edge cases with alignment and overflow
  • Interaction with garbage collection
  • API ergonomics take time to get right

Each incubator refines based on developer feedback. The API is close to final, but still evolving.

When to Use Vector API

Good use cases:

  • Image/video processing (pixel operations)
  • Machine learning inference
  • Scientific computing
  • Financial calculations
  • Cryptographic operations
  • Audio processing

Not worth it:

  • Small arrays (overhead exceeds benefit)
  • Branch-heavy code (vectors don’t help)
  • Memory-bound operations (CPU waits on RAM)

Performance Expectations

Real-world gains vary:

Typical Speedups
Operation Type Speedup vs Scalar
─────────────────────────────────────
Simple arithmetic 4-8x
Complex expressions 3-6x
Memory-bound 1.5-2x
Small arrays ~1x (no benefit)

Your mileage depends on CPU, data size, and operation complexity.

Summary

In this post, I showed how to use Java 26’s Vector API for SIMD operations. The key points are:

  • Vector API lets you write vectorized code in pure Java
  • Same code compiles to optimal instructions on different CPUs
  • Use --add-modules jdk.incubator.vector to enable
  • Best for large arrays with arithmetic operations
  • Eleventh incubator means API is mature but still evolving

The Vector API brings high-performance computing to Java without native code. When it exits incubator status, expect it to become a standard tool for numerical workloads.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments