TeAAL and HiFiber: Precise and Concise Descriptions of (Sparse) Tensor Algebra Accelerators

Overview

This tutorial (hosted in conjuction with MICRO 2025) will show how to distill the variety we see in efficient implementations of tensor algebra kernels (in both hardware and software) into a small set of common abstractions. The tutorial will consist of a series of talks by the organizers with references to specific code examples that participants can explore afterwards. The key learning objective will be to teach participants a new language for precisely/concisely describing accelerators in mediums such as research papers.

Motivation

Tensor algebra workloads have exploded in popularity over the past few years, with applications ranging from deep learning to graph algorithms to physical simulations. This surge has been accompanied by a corresponding rise in proposals for custom hardware to service common kernels, e.g., matrix multiply or convolution. However, performing tensor algebra kernels efficiently can be difficult, so implementations of these kernels often look quite different. Because there is no separation of concerns between the different features that comprise the design, the details of the algorithm, dataflow, tensor formats, and so on are all entangled, making each accelerator seem like a one-off exotic technique. Without a separation of concerns, it is difficult to perform apples-to-apples comparisons between existing designs or evaluate the impact of proposed design changes.

Key Learning Objectives

Participants will learn:

A set of fundamental abstractions (cascades of Einsums, fibertrees, etc.) that can be used to describe domain-specific kernels
A declarative specification language (TeAAL) for writing an accelerator design in terms of the above abstractions
A format-agnostic loop nest representation (HiFiber) for visualizing and executing kernel implementations
A set of analysis techniques that can be applied to kernels expressed in terms of the above abstractions

As a part of the tutorial, we provide an accelerator zoo—a list of recent accelerator proposals, their TeAAL specifications, and calls to a compiler to automatically generate the corresponding HiFiber code (from the TeAAL specification).

Agenda

1-1:20pm: Introduction (presented by Joel Emer)
1:20-1:45pm: First Look at TeAAL: From Algorithm to Hardware (presented by Yan Zhu)
1:45-2:05: Mapping for Dense Tensor Algebra (presented by Toluwa Odemuyiwa)
2:05-2:30pm: Cascades of Einsums: Expanding the Algorithm Specification (presented by Toluwa Odemuyiwa)
2:30-3pm: FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design (presented by Nandeeka Nayak)
3-3:30pm: Break
3:30-4:10pm: Introduction and Deep Dive into Fibertrees (presented by Joel Emer)
4:10-4:30pm: Optimization on Sparse Fibertrees (presented by Yan Zhu)
4:30-4:50pm: Performance Modeling with HiFiber (presented by Nandeeka Nayak)
4:50-5pm: Conclusion (presented by Joel Emer)

The slide decks for all talks can be found here.

Organizers

Nandeeka Nayak is a Computer Science PhD student at University of California, Berkeley, advised by Chris Fletcher. She works on understanding efficient implementations of domain-specific kernels with a focus on building abstractions that unify a wide variety of kernels and accelerator designs into a small set of primitives.

Toluwanimi O. Odemuyiwa (Toluwa) is an Electrical and Computer Engineering PhD Candidate at UC Davis, advised by John Owens. Her work focuses on exploring tensor algebra-based abstractions for graph algorithms (and other domains) in order to succinctly describe and explore the algorithmic and implementation space.

Yan Zhu is an EECS Ph.D. student at the University of California, Berkeley, advised by Chris Fletcher. Her research interests lie in domain-specific acceleration, with a particular focus on optimizing sparse computing applications. Her current work centers on accelerating applications with inherent sparsity, such as RTL simulation, by generalizing existing sparse tensor algebra analysis and optimization techniques. Before joining UC Berkeley, she received a B.A.S. in Engineering Science from the University of Toronto.

Michael Pellauer is a Principal Research Scientist at Nvidia’s Architecture Research Group (ARG). His research focuses on domain-specific hardware accelerators, and how their learnings can be integrated into a programmable substrate like a GPU. His current focus is on sparse tensor algebra acceleration for deep learning. He has a PhD from MIT in Computer Science, a Master of Science from Chalmers University of Technology, and a double Bachelors from Brown University in Computer Science and English. He previously worked at Intel Corporation’s Versatile Systems and Simulation Advanced Development (VSSAD) group as a senior architect.

Christopher W. Fletcher (Chris) is an Associate Professor in Computer Science at the University of California, Berkeley. He has broad interests ranging from Computer Architecture to Security to High-Performance Computing (ranging from theory to practice).

Joel S. Emer received B.S. (Hons.) and M.S. degrees in electrical engineering from Purdue University in 1974 and 1975, respectively, and the Ph.D. degree in electrical engineering from the University of Illinois at Urbana–Champaign, in 1979. He is a Professor of the Practice in Electrical Engineering and Computer Science Department at MIT and a Senior Distinguished Research Scientist at NVIDIA.

Contributors

We would also like to extend a special thank you to following people, who have contributed to this tutorial by adding accelerators to the accelerator zoo:

Bosung An (undergrad; Seoul National University)
Timor Averbuch (undergrad; University of Illinois, Urbana-Champaign)
Alex Dicheva (undergrad; University of Illinois, Urbana-Champaign)
Yuxin Jin (undergrad; Tsinghua University)
Jules Peyrat (masters; École Polytechnique)
Chenxi Wan (undergrad; University of Science and Technology of China)
Yingchen Wang (postdoc; University of California, Berkeley)
Frederic Wu (undergrad; University of California, Berkeley)
Xinrui Wu (undergrad; Tsinghua University)
Seunghyun Yong (undergrad; Seoul National University)
Sabrina Yarzada (PhD student; University of Southern California)

Resources

Tutorial Artifacts

Accelerator Zoo - A set of recent accelerator proposals, their TeAAL specifications, and calls to a compiler automatically generate the corresponding HiFiber code (from the TeAAL specification).
Tutorial Slides - All slide decks used while presenting the tutorial

Background Reading

TeAAL: A Declarative Framework for Modeling Sparse Tensor Accelerators - Nandeeka Nayak, Toluwanimi O. Odemuyiwa, Shubham Ugare, Christopher W. Fletcher, Michael Pellauer, and Joel S. Emer (MICRO 2023)
FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design - Nandeeka Nayak, Xinrui Wu, Toluwanimi O. Odemuyiwa, Michael Pellauer, Joel S. Emer, and Christopher W. Fletcher (MICRO 2024)
The EDGE Language: Extended General Einsums for Graph Algorithms - Toluwanimi O. Odemuyiwa, Joel S. Emer, John D. Owens (ArXiv 2024)
Efficient Processing of Deep Neural Networks - Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, Joel S. Emer (Springer 2020)
Format Abstraction for Sparse Tensor Algebra Compilers - Stephen Chou, Fredrik Kjolstad, Saman Amarasinghe (OOPSLA 2018)
The Tensor Algebra Compiler - Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, Saman Amarasinghe (OOPSLA 2017)