Ozaki Scheme -- OpenCL¶

High-precision GEMM on OpenCL devices via mantissa slicing (Scheme 1) or Chinese Remainder Theorem (Scheme 2). Both schemes decompose FP matrices into int8/u8 tiles and use DPAS/XMX matrix engines when available. This is an OpenCL adaptation of the CPU-based Ozaki sample in LIBXS.

Build¶

cd samples/ozaki
make [GNU=1] [DBG=1]

Requires an OpenCL runtime and headers. BLAS is linked via BLAS=2 for the reference GEMM.

Run¶

./ozaki.x [M [N [K [transa [transb [alpha [beta [lda [ldb [ldc]]]]]]]]]]

All arguments are positional and optional:

Pos.	Argument	Default	Description
1	M	257	Rows of C and op(A)
2	N	M	Columns of C and op(B)
3	K	M	Inner dimension
4	transa	0	0=N, 1=T for A
5	transb	0	0=N, 1=T for B
6	alpha	1	Scalar multiplier for A*B
7	beta	1	Scalar multiplier for C
8	lda	auto	Leading dimension of A
9	ldb	auto	Leading dimension of B
10	ldc	M	Leading dimension of C

Environment Variables¶

Scheme Selection¶

Variable	Default	Description
OZAKI	2	1=mantissa slicing, 2=CRT (default), 3=adaptive, 0=bypass BLAS
OZAKI_FP	64	64=fp64 (double), 32=fp32 (float)
OZAKI_N	(auto)	Slices (Sch.1: fp64=8, fp32=4) or primes (Sch.2: fp64=16, fp32=9)

OZAKI=3 (adaptive) starts with Scheme 1 on the first call to learn the effective cutoff from preprocessing occupancy data. Subsequent calls compare the Scheme-1 pair count against the Scheme-2 prime count and pick the cheaper path. The cutoff is cached alongside the preprocessed buffers and reused on cache hits without any device-to- host readback.

Accuracy¶

Variable	Default	Description
OZAKI_FLAGS	3	Sch.1 bitmask: 1=Triangular, 2=Symmetrize, 0=full S^2. No Sch.2
OZAKI_TRIM	0	Precision levels to trim (0=exact). ~7 bits (Sch.1), ~4 bits (Sch.2)
OZAKI_I8	0	Sch.2: use signed i8 residues (moduli<=128) instead of u8
OZAKI_GROUPS	0	Sch.2: K-grouping factor, consecutive K panels share reconstr.

Hardware Control¶

Variable	Default	Description
OZAKI_RTM	(auto)	Register tiling M (power of two). Auto: 2 (HIER), 4 (256-GRF)
OZAKI_RTN	(auto)	Register tiling N (power of two). Auto: 2 (Intel GPU), 1 (other)
OZAKI_WG	0	Work-group size hint (0=no hint)
OZAKI_SG	(auto)	Sub-group size (forced to 16 with XMX)
OZAKI_BIGGRF	(auto)	Override 256-GRF detection (0=off, 1=on). HIER defaults to 128
OZAKI_KU	2	K-loop unroll factor
OZAKI_RC	8	DPAS repeat count (8 or 4)
OZAKI_PB	1	Sch.2: CRT prime batching factor
OZAKI_HIER	(auto)	Sch.2: hierarchical CRT (default on). Two-level Garner reconstr.
OZAKI_PREFETCH	0	Sch.1: enable prefetching
OZAKI_SCALAR_ACC	0	Sch.1: force scalar accumulation

Memory and Caching¶

Variable	Default	Description
OZAKI_DEVPOOL	0	Device memory pool via USM/SVM (eliminates per-call alloc overhead)
OZAKI_CACHE	0	Preprocessing cache bitmask: 1=A, 2=B, 3=both. Skips on match

The preprocessing cache also stores the last effective cutoff from Scheme 1 occupancy detection. On cache hits the cutoff is reused without device-to-host readback, eliminating the sync bubble.

Benchmark¶

Variable	Default	Description
NREPEAT	1	Number of benchmark repetitions
OZAKI_VERBOSE	0	0=silent, 1=errors, 2=warnings, 3+=all. Neg.=all

Additional variables for profiling, accuracy monitoring, and complex GEMM dispatch (OZAKI_PROFILE, OZAKI_THRESHOLD, OZAKI_STAT, OZAKI_EPS, OZAKI_RSQ, OZAKI_EXIT, OZAKI_COMPLEX) are handled by the LIBXS Ozaki sample (LIBXS), which owns the GEMM interceptor. See its README for details.

Kernel Registry¶

Scheme 1 fused GEMM kernels are compiled on demand via a JIT registry. The compile-time cutoff (OZAKI_CUTOFF) is baked into each kernel specialization, allowing the compiler to eliminate dead slice-pair iterations and reduce register pressure. The first call with a given cutoff value triggers JIT compilation (~100 ms); subsequent calls hit the registry cache. Typical workloads produce 2-3 specializations (full cutoff, reduced cutoff, each with/without bounds checking).

Example¶

./ozaki.x 256

Scheme 2 on a large matrix:

OZAKI=2 ./ozaki.x 4096

Adaptive scheme selection with caching:

OZAKI=3 OZAKI_CACHE=3 ./ozaki.x 4096

Quick Tuning Guide¶

Scheme 2 (CRT, OZAKI=2, default): fixed cost of P integer GEMMs plus hierarchical Garner reconstruction. Predictable performance regardless of data distribution. Use OZAKI_GROUPS for K-grouping at large sizes. The hierarchical CRT (OZAKI_HIER, on by default) halves private residue arrays and enables GRF128 for doubled thread occupancy.

Scheme 1 (mantissa slicing, OZAKI=1): up to S*(S+1)/2 integer GEMMs, but adaptive cutoff can reduce this substantially for narrow exponent spans. Use OZAKI_TRIM to trade accuracy for speed.

Adaptive (OZAKI=3): automatically picks the cheaper scheme per call based on preprocessing occupancy. Best with OZAKI_CACHE=3 to avoid repeated occupancy readbacks.

Enable OZAKI_CACHE=3 when A or B stays constant across calls. Enable OZAKI_DEVPOOL=1 for repeated calls with similar sizes.