## **Post-Moore's Law Fusion**

## High-Bandwidth Memory, Accelerators, and Native Half-Precision Processing for CPU-Local Analytics

Viktor Sanca, Anastasia Ailamaki

ADMS 2023



in viktor-sanca viktor.sanca@epfl.ch





Chiplets

# **CPU Evolution:** The Day of Reckoning



Specialization [Intel, AMD, IBM, Apple, ...]

[Moore's Law]

Heterogeneous + specialized unit interactions

# **CPU Evolution:** The Day of Reckoning

#### G. Moore: Cramming Components Onto Integrated Circuits (1965)

It may prove to be more economical to build large systems out of smaller functions, which are separately packaged and interconnected. The availability of large functions, combined with functional design and construction, should allow the manufacturer of large systems to design and construct a considerable variety of equipment both rapidly and economically.

Section VIII: Day of Reckoning

[Moore's Law] Chiplets



Specialization [Intel, AMD, IBM, Apple, ...]

Heterogeneous + specialized unit interactions

#### Fusion and mix of CPU components: bottleneck shift and novel tradeoffs





## Large System Out of Small Functions







# Large System Out of Small Functions



Intel Embedded Multi-die Interconnect Bridge

Open Chiplet Ecosystem Universal Chiplet Interconnect Express (UCIe)

Interconnected Chiplets: Increased On-Socket NUMA Granularity

# A Fusion of Components for Modern Workloads



## Complex interplay of novel memory + computation non-uniformity





# The Big Picture: Intel Sapphire Rapids

Intel Xeon 9480 MAX 4 Chiplets/Tiles Configuration

Per Tile 14 cores (28HT), 16GB HBM2e, 64GB DDR5

Total

56 cores (112HT), 64GB HBM2e, 256GB DDR5





# The Big Picture: Intel Sapphire Rapids

Intel Xeon 9480 MAX 4 Chiplets/Tiles Configuration

Per Tile 14 cores (28HT), 16GB HBM2e, 64GB DDR5

Total

56 cores (112HT), 64GB HBM2e, 256GB DDR5

NUMA

8 regions per socket: 4 HBM + 4 CPU/DRAM



#### Evolution: from monolithic CPU resources to fine-grained control



# Interaction of Individual Functionalities



#### **1. High-Bandwidth Memory** Bandwidth-Bound Workloads

**2. Native Half-Precision Types** Optimized Computation + Vectorization

**3. Specialized Hardware Accelerators** Offload CPU Cores By Specialized Units

## Evaluate the interplay of granular memory and computational decisions



# HBM vs. DRAM: Extending the Memory Hierarchy

Intel Memory Latency Checker (MLC v3.10) Bandwidth Matrix





2-3.7x bandwidth increase on socket for shifting the DRAM bottleneck



## HBM vs. DRAM + Interconnects: Latency Slowdown

Intel Memory Latency Checker (MLC v3.10) Latency Matrix





Up to 30% higher latency over DRAM for EMIB, negligible for UPI



# The Impact of HBM: Data Access Patterns

Workload: 1B FP32 elements + Tile-local processing (up to 28 threads)

Access patterns SCAN: full sequential scan (bandwidth)

**RANDM:** random access (latency)

SEQM: sequential scan with indirection (mix)







Evaluate extra bandwidth and higher latency on generalized patterns



## Data Access Pattern: Bottleneck Shift

SCAN: sequential scan, RANDM: random access, SEQM: sequential scan with indirection; 1B elements, 28 threads



HBM provides additional resources with similar scalability characteristics

## Higher HBM Latency + Random Access Improvement



HBM scales with additional resources consuming/starving for data

# Scaling The Bandwidth Wall

SUM: summing up 1B elements, up to 28 threads on a single tile, DRAM/HBM local data.



Breaking the DRAM bandwidth wall with the benefit of data + core locality

# Native Half-Precision Types: ML-Driven Opportunity



**1. High-Bandwidth Memory** Bandwidth-Bound Workloads and Access Patterns

**2. Native Half-Precision Types** Optimized Computation + Vectorization

**3. Specialized Hardware Accelerators** Offload CPU Cores By Specialized Units + Accelerate

## Hardware-supported types enable fine-grained memory + compute tuning

# **Reducing Transfer Size and Computation Footprint**

Workload: 1B elements SUM-IF, varying the data type and placement in HBM/DRAM



HBM + Types: benefit depends on the shifted memory + compute bottleneck



#### <u> </u>AiAS

# **On-The-Fly Intermediate Type Conversion**

Workload: 1B elements, pair-wise multiply-add, FP32->BF16 and FP16 only, DRAM + HBM



## HBM alleviates the data movement bottleneck for efficient computation

#### EPFL

# Accelerators: Advanced Matrix Extensions (AMX)



1. High-Bandwidth Memory

Bandwidth-Bound Workloads and Access Patterns

**2. Native Half-Precision Types** Optimized Computation + Vectorization

## **3. Specialized Hardware Accelerators** Offload CPU Cores By Specialized Units + Accelerate Workloads

# Tile Matrix Multiply (TMUL): Dot Product





#### x 8 Register Files + TMUL

Mix-and-match: specialized core-local resources added to design space

## **Use Case: Accelerating Vector Computations**

Workload: 1M tuples x 512-D vector, computing dot products against 512-D vector (on-the-fly BF16 conversion for AMX)



Offload computation from cores: complex decisions inside single socket





# **Growing CPU Compute and Storage Heterogeneity**

Workload: 1M tuples x 512-D vector, computing dot products against 512-D vector (on-the-fly BF16 conversion for AMX)





## **Expected** Moore's Law: Large System of Small Functions



From Monolithic to Complex Heterogeneous CPUs On-the-fly system adaptation for any hardware

**Complex Memory and Compute Interactions** Automating workload benchmarking and tuning [Chaosity@TPC-TC'23]

**Tailored and Optimized Data Structures and Algorithms** Using novel hardware fusion with principled design

in viktor-sanca viktor.sanca@epfl.ch

Build adaptive and hardware-conscious systems for inevitable complexity