Ozaki Scheme -- OpenCL

High-precision GEMM on OpenCL devices via mantissa slicing (Scheme 1) or Chinese Remainder Theorem (Scheme 2). Both schemes decompose FP matrices into int8/u8 tiles and use DPAS/XMX matrix engines when available. This is an OpenCL adaptation of the CPU-based Ozaki sample in LIBXS.

Build

cd samples/ozaki
make [GNU=1] [DBG=1]

Requires an OpenCL runtime and headers. BLAS is linked via BLAS=2 for the reference GEMM.

Run

./ozaki.x [M [N [K [transa [transb [alpha [beta [lda [ldb [ldc]]]]]]]]]]

All arguments are positional and optional:

Pos. Argument Default Description
1 M 257 Rows of C and op(A)
2 N M Columns of C and op(B)
3 K M Inner dimension
4 transa 0 0=N, 1=T for A
5 transb 0 0=N, 1=T for B
6 alpha 1 Scalar multiplier for A*B
7 beta 1 Scalar multiplier for C
8 lda auto Leading dimension of A
9 ldb auto Leading dimension of B
10 ldc M Leading dimension of C

Environment Variables

Scheme Selection

Variable Default Description
OZAKI 2 1=mantissa slicing, 2=CRT (default), 3=adaptive, 0=bypass BLAS
OZAKI_FP 64 64=fp64 (double), 32=fp32 (float)
OZAKI_N (auto) Slices (Sch.1: fp64=8, fp32=4) or primes (Sch.2: fp64=16, fp32=9)

OZAKI=3 (adaptive) starts with Scheme 1 on the first call to learn the effective cutoff from preprocessing occupancy data. Subsequent calls compare the Scheme-1 pair count against the Scheme-2 prime count and pick the cheaper path. The cutoff is cached alongside the preprocessed buffers and reused on cache hits without any device-to- host readback.

Accuracy

Variable Default Description
OZAKI_FLAGS 3 Sch.1 bitmask: 1=Triangular, 2=Symmetrize, 0=full S^2. No Sch.2
OZAKI_TRIM 0 Precision levels to trim (0=exact). ~7 bits (Sch.1), ~4 bits (Sch.2)
OZAKI_I8 0 Sch.2: use signed i8 residues (moduli<=128) instead of u8
OZAKI_GROUPS 0 Sch.2: K-grouping factor, consecutive K panels share reconstr.

Hardware Control

Variable Default Description
OZAKI_RTM (auto) Register tiling M (power of two). Auto: 2 (HIER), 4 (256-GRF)
OZAKI_RTN (auto) Register tiling N (power of two). Auto: 2 (Intel GPU), 1 (other)
OZAKI_WG 0 Work-group size hint (0=no hint)
OZAKI_SG (auto) Sub-group size (forced to 16 with XMX)
OZAKI_BIGGRF (auto) Override 256-GRF detection (0=off, 1=on). HIER defaults to 128
OZAKI_KU 2 K-loop unroll factor
OZAKI_RC 8 DPAS repeat count (8 or 4)
OZAKI_PB 1 Sch.2: CRT prime batching factor
OZAKI_HIER (auto) Sch.2: hierarchical CRT (default on). Two-level Garner reconstr.
OZAKI_PREFETCH 0 Sch.1: enable prefetching
OZAKI_SCALAR_ACC 0 Sch.1: force scalar accumulation

Memory and Caching

Variable Default Description
OZAKI_DEVPOOL 0 Device memory pool via USM/SVM (eliminates per-call alloc overhead)
OZAKI_CACHE 0 Preprocessing cache bitmask: 1=A, 2=B, 3=both. Skips on match

The preprocessing cache also stores the last effective cutoff from Scheme 1 occupancy detection. On cache hits the cutoff is reused without device-to-host readback, eliminating the sync bubble.

Benchmark

Variable Default Description
NREPEAT 1 Number of benchmark repetitions
OZAKI_VERBOSE 0 0=silent, 1=errors, 2=warnings, 3+=all. Neg.=all

Additional variables for profiling, accuracy monitoring, and complex GEMM dispatch (OZAKI_PROFILE, OZAKI_THRESHOLD, OZAKI_STAT, OZAKI_EPS, OZAKI_RSQ, OZAKI_EXIT, OZAKI_COMPLEX) are handled by the LIBXS Ozaki sample (LIBXS), which owns the GEMM interceptor. See its README for details.

Kernel Registry

Scheme 1 fused GEMM kernels are compiled on demand via a JIT registry. The compile-time cutoff (OZAKI_CUTOFF) is baked into each kernel specialization, allowing the compiler to eliminate dead slice-pair iterations and reduce register pressure. The first call with a given cutoff value triggers JIT compilation (~100 ms); subsequent calls hit the registry cache. Typical workloads produce 2-3 specializations (full cutoff, reduced cutoff, each with/without bounds checking).

Example

./ozaki.x 256

Scheme 2 on a large matrix:

OZAKI=2 ./ozaki.x 4096

Adaptive scheme selection with caching:

OZAKI=3 OZAKI_CACHE=3 ./ozaki.x 4096

Quick Tuning Guide

Scheme 2 (CRT, OZAKI=2, default): fixed cost of P integer GEMMs plus hierarchical Garner reconstruction. Predictable performance regardless of data distribution. Use OZAKI_GROUPS for K-grouping at large sizes. The hierarchical CRT (OZAKI_HIER, on by default) halves private residue arrays and enables GRF128 for doubled thread occupancy.

Scheme 1 (mantissa slicing, OZAKI=1): up to S*(S+1)/2 integer GEMMs, but adaptive cutoff can reduce this substantially for narrow exponent spans. Use OZAKI_TRIM to trade accuracy for speed.

Adaptive (OZAKI=3): automatically picks the cheaper scheme per call based on preprocessing occupancy. Best with OZAKI_CACHE=3 to avoid repeated occupancy readbacks.

Enable OZAKI_CACHE=3 when A or B stays constant across calls. Enable OZAKI_DEVPOOL=1 for repeated calls with similar sizes.