Barranco Studio

Inside Metal 4: How Apple’s New Quantized Tensors Keep Giant AI Models on the iPhone

As state-of-the-art neural networks continue to balloon in size, the engineering bottlenecks flanking artificial intelligence have fundamentally shifted. While training massive transformer networks remains an absolute compute struggle, executing those models during the inference phase is almost entirely a battle against memory bandwidth. The GPU spending massive amounts of time idling, waiting for model weights to travel from system memory into the execution cores.

To keep massive language models and diffusion pipelines running locally without triggering aggressive thermal throttling or blowing through device memory limits, hardware architectures must compress data down to the metal. At Apple's latest software engineering deep dive, Shiyao, a lead GPU Software Engineer, outlined Cupertino’s updated strategy for navigating this bottleneck: a complete overhaul of the Metal Shading Language (MSL) TensorOps library, unlocking multi-plane quantization frameworks and native, register-level fused attention algorithms inside macOS and iOS 27.


The Modern ML Stack on Apple Silicon

Apple’s machine learning infrastructure is explicitly layered to balance high-level developer agility with low-level hardware control. While most consumer-facing applications interact with the top tier of the ecosystem, modern optimization requires a direct understanding of how data trickles down to the silicon:

  • Domain & High-Level Frameworks: Core AI and MLX provide minimal-code environments for model deployment and PyTorch model conversion.
  • Mid-Level Acceleration: Metal Performance Shaders (MPS) and MPSGraph expose pre-compiled, highly stable GPU kernel sets.
  • Low-Level Primitives: Metal Performance Primitives and the TensorOps library interface directly with MSL shaders, automatically routing dense matrix workloads straight into the dedicated hardware neural accelerators packed inside every shader core of the M5 chip family.

Multi-Plane Tensors: Packing Scales Next to Data

Standard model compression relies heavily on quantization—reducing high-precision 16-bit half-precision (FP16) or 32-bit floating-point weights down to ultra-dense 4-bit or 2-bit integer formats. However, crushing weights down this aggressively requires accompanying scale factors to dynamically restore data back to its original mathematical range during execution.

Historically, managing these scale factors meant writing custom, messy address tracking routines to pull data from separate memory buffers. In macOS and iOS 27, Apple eliminates this architectural complexity by introducing native Multi-Plane Tensors. Under this updated specification, a single MTLTensor object encapsulates both the compressed data matrix and its scaling parameters into unified physical metadata blocks.

┌────────────────────────────────────────────────────────┐
│ [Unified MTLTensor Object]                             │
│  ┌──────────────────────────────────────────────────┐  │
│  │ Data Plane: INT4 / INT2 Quantized Weights         │  │
│  └──────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────┐  │
│  │ Scale Plane: FP8 E8M0 Block-Wise Scale Factors   │  │
│  └──────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────┘

The scale plane explicitly supports the highly efficient FP8 E8M0 block-wise scale factor format. Rather than assigning a distinct scaling float to every individual weight, a single element inside the scale plane controls an entire sub-block of data (such as a 32×1 allocation grid) within the data plane, drastically dropping memory traffic during large-scale matrix operations.


Quantized Matrix Math in Metal Shading Language

Implementing this multi-plane architecture inside custom Metal kernels requires declaring explicit type aliases to cleanly handle the underlying block boundaries. Shaders can interact with host-allocated memory handles directly, or alternatively, construct ultra-fast tensor_inline objects right on the shader's local stack:

// Defining a multi-plane quantized tensor type configuration in MSL
using scale_plane_t = tensor_descriptor>;
using quantized_tensor_t = tensor;
kernel void QuantizedMatMul(
device void* weight_buffer     [[buffer(0)]],
device void* scale_buffer      [[buffer(1)]],
uint2 threadgroup_id           [[threadgroup_position_in_grid]])
{
// Instantiating a temporary inline tensor directly on the shader stack
tensor_inline weights(weight_buffer, scale_buffer);
// Synchronously slice the data and scales planes by threadgroup position
auto tileWeights = weights.slice(threadgroup_id.y, threadgroup_id.x);
// TensorOps handles dequantization automatically during execution
matmul2d_descriptor descriptor;
matmul2d_op op(4); // Configured for 4 coordinating SIMD groups
op.run(tileWeights, ...);
}

Register Dequantization vs. Threadgroup Memory

When working with specialized or custom quantization formats that cannot leverage multi-plane configurations directly, developers frequently default to loading compressed data into Shared Local Memory (Threadgroup Memory), dequantizing it to FP16, and streaming it out to the execution blocks. However, this method introduces costly round-trip load/store penalties across the local memory bus.

The Efficiency Path: TensorOps avoids this memory tax by allowing developers to unpack custom formats directly into Cooperative Tensors. Because cooperative tensors distribute their structural contents entirely across the local register files of the participating threads, developers can execute custom dequantization routines completely in-register, feeding the resulting values straight into matrix execution engines.

Building Advanced Operations: Fusing FlashAttention

The flexibility of cooperative tensor integration becomes critical when building advanced, mathematically complex fusion layers like FlashAttention. Calculating attention requires multiplying Query (Q) and Key (K) matrices, executing row-level SoftMax reductions across the intermediate results, and multiplying the product by a Value (V) matrix.

[Q Matrix] × [K Matrix] ──► Cooperative Tensor (Intermediate Product)
│
┌─────────────────────────────────┴─────────────────────────────────┐
▼                                                                   ▼
reduce_rows() ──► Row Maxima    ──► map_iterator() ──► In-Register SoftMax
│
[V Matrix] × [Left Input Cooperative Tensor] ◄─────────────────────────┘

To construct this efficiently using TensorOps, developers configure a highly isolated SIMD group mapping using the execution_simdgroup scope. This setup assigns complete rows of the intermediate matrix to individual SIMD groups, allowing them to calculate mathematical reductions completely independently without exchanging data across the broader threadgroup.

The code below demonstrates how to leverage internal iterators and native reduction operators to execute an in-register SoftMax row calculation before feeding the result directly as a left-hand input into a secondary matrix multiplication block:

// Compute row-level maximums across the intermediate cooperative tensor
cooperative_tensor intermediate_tensor;
cooperative_tensor row_maxima;
// Execute row reductions across threads via native Max operations
reduce_rows(intermediate_tensor, row_maxima, reduction_output::max, -INFINITY);
// Map and traverse the 2D matrix structure using internal iterators
auto matrix_iter = intermediate_tensor.begin();
auto matrix_end  = intermediate_tensor.end();
while (matrix_iter != matrix_end) {
// Dynamically resolve the corresponding row maximum value
auto max_iter = map_iterator(matrix_iter, row_maxima);
// Compute the localized SoftMax value directly inside registers
half softmax_val = exp(*matrix_iter - *max_iter);
matrix_iter.set_element(softmax_val);
++matrix_iter;
}
// Verify register layout compatibility before recycling the tensor as an input
if (matmul2d_op::is_compatible_as_left_input(intermediate_tensor)) {
auto left_input = intermediate_tensor.get_left_input_cooperative_tensor();
// Execute secondary matrix multiplication against the V matrix entirely on-chip
second_matmul_op.run(left_input, tileV, output_tile);
}

Bridging Custom Kernels into Python via Core AI

Writing a highly optimized low-level shader is meaningless if it remains isolated from modern research environments. To connect these custom Metal Shading Language layers back to machine learning researchers, Apple provides a direct integration bridge inside its Python-driven Core AI compilation stack.

Engineers can encapsulate their compiled MSL code string directly inside a TorchMetalKernel object, swap out standard HuggingFace execution blocks with their specialized custom methods, and compile the entire setup down into an optimized asset package. Testing this integration on production architectures—such as deploying a fused FlashAttention kernel directly inside a **Segment Anything 3 (Sam3)** image segmentation model—shows seamless system execution, returning clean object mask arrays with minimal translation latency.

By pairing deeply integrated low-bit data structures (INT4/INT2/FP8) with rigorous register-level calculation loops, Apple’s updated TensorOps framework hands developers an aggressive architectural runway for optimizing edge-based transformer pipelines directly on consumer-class hardware enclosures.

Libera el poder de la Inteligencia Artificial en tu empresa

Desde optimizar procesos hasta predecir tendencias, Machine Learning ofrece una amplia posibilidad para impulsar el crecimiento y la eficiencia empresarial. Esta tecnología revolucionaria puede transformar los negocios, proporcionando insights valiosos, automatizando tareas repetitivas y mejorando la toma de decisiones. Un mundo de oportunidades para las empresas.

Actualidad

Publicaciones recientes sobre Machine Learning y Mobile App development.

Projects