Image Generated Using Adobe Firefly
Graphics Processor Units (GPUs) have become the default go-to architecture whenever the requirement is for faster throughput. The primary reason is the massively parallel processing with the help of many cores and how the memories get organized around them.
Due to the benefits of such an architecture, the AI World has also adopted GPUs as its go-to silicon architecture. The goal is to process large amounts of data in the shortest time possible; other technical reasons are reusability and portability, which lower the entry barrier for new companies developing large-scale AI solutions.
Several semiconductor companies provide GPU solutions. However, NVIDIA is winning the GPU race so far, and the main reason is the near-perfect software-to-silicon ecosystem it has created. It enables new and existing customers to adapt to the latest GPU type swiftly and also new AI Frameworks, all while keeping the reusability and portability cost under check.
What does not work in favor of GPU architecture:
Availability: GPUs (mainly from NVIDIA) are inching towards 3nm. There will be a race to capture the available worldwide capacity with only one Pure-Play vendor capable of producing yieldable silicon chips. It will take a lot of work to capture the required demand-drive power.
Cost: GPU will start adopting ultra-advanced (3nm and lower) nodes. The cost of designing and manufacturing these silicon chips will increase further. More so when GPUs are yet to find a way out of the die-level solution to a more More-Than-Moore (MtM) path. In a year or two, GPUs designed for AI workload will surely reach the reticle limit, which even EUVs cannot support.
Not Application-Specific: GPUs are still general-purpose in terms of application requirements. The SIMD, MIMD, Floating, and Vector level translations usually only fit some requirements. Conversely, the AI developers (mainly large-scale software companies) will keep seeing the need for more application-specific (thus why TPUs came into existence) architecture that can provide a solution-level GPU.
Deployment: Deploying stacked GPUs is like bringing up a massive farm. It increases the cost of operating such data farms. On top of that, the more powerful the GPUs are, the more influential the applications become. Thus, increasing the data processing request leads to more performance and energy consumption.
Sooner or later, even GPU architecture will reach a state where they may not be the first choice for AI. Currently, the software industry (or the AI industry) relies on the GPU architecture primarily due to the mega data centers using these and being the best broadly deployed architecture in the market.
However, as more new types of AI usage and requirements arise, the software industry will realize that the GPU architecture is unsuitable for their applications. Thus, there is a demand for more customized performance-oriented silicon architecture.
The need for customized AI silicon architecture has already caught the eyes of both the software and silicon industry. It is leading to more silicon-level solutions. That can replace or give GPU architecture robust competition.
There is a specific type of silicon architecture that has the potential to replace GPUs shortly. Below are a few:
Wafer-Scale Engine (WSE) SoCs:
The Wafer-Scale Engine (WSE) represents a paradigm shift in computing, indicating a new era where traditional GPUs get replaced in specific applications. Unlike GPUs that contain thousands of small processors, a WSE is a single, giant chip that can house hundreds of thousands of cores capable of parallel processing. This architectural leap enables a WSE to process AI and machine learning workloads more efficiently due to its vast on-chip memory and reduced data transfer latency. By eliminating the bottlenecks inherent in multi-chip approaches, a WSE can deliver unprecedented performance, potentially outpacing GPUs in tasks that can leverage its massive, monolithic design. As AI and complex simulations demand ever-faster computational speeds, WSEs could supplant GPUs in high-performance computing tasks, offering a glimpse into the future of specialized computing hardware.
Chiplets With RISC-V SoCs:
Chiplets utilizing the RISC-V architecture present a compelling alternative to conventional GPUs for specific computing tasks, mainly due to their modularity and customizability. RISC-V, being an open-source instruction set architecture (ISA), allows for the creation of specialized processing units tailored to specific workloads. When these processors adopt chiplets (small, modular silicon blocks), the larger chip can be manufactured into a coherent, scalable system. The computing system gets optimized for parallel processing, similar to GPUs, but with the added advantage of each chiplet being custom-crafted to handle particular segments of a workload efficiently. In scenarios where energy efficiency, space constraints, and specific application optimizations are paramount, RISC-V chiplets could feasibly replace GPUs by providing similar or superior performance metrics while reducing power consumption and increasing processing speed by tailoring the hardware directly to the software’s needs.
Tensor Processing Units SoCs:
Tensor Processing Units (TPUs), application-specific integrated circuits (ASICs) designed for machine learning tasks, offer a specialized alternative to GPUs. As System-on-a-chip (SoC) designs, TPUs integrate all the components needed for neural network processing onto a single chip, including memory and high-speed interconnects. Their architecture is tuned for the rapid execution of tensor operations, the heart of many AI algorithms, which enables them to process these workloads more efficiently than the more general-purpose GPUs. With their ability to perform a higher number of operations per watt and their lower latency due to on-chip integration, TPUs in an SoC format can provide a more efficient solution for companies running large-scale machine learning computations, potentially replacing GPUs in data centers and AI research facilities where the speed and efficiency of neural network processing are crucial.
PIM SoCs:
Processing-in-memory (PIM) technology, particularly when embedded within a System on a Chip (SoC), is poised to disrupt the traditional GPU market by addressing the ‘memory wall’ problem. PIM architectures integrate processing capabilities directly into the memory chips, allowing data computation where it is stored, thereby reducing the time and energy spent moving data between the processor and memory. As an SoC, integrating PIM with other necessary system components can lead to even more significant optimizations and system-level efficiency. In applications such as data analytics, neural networks, and other tasks that require rapid, parallel processing of large data sets, PIM SoCs could potentially outperform GPUs by leveraging their ability to bypass the data transfer bottlenecks that GPUs face, delivering faster insights and responses, especially in real-time processing scenarios.
One factor all of the above solutions need to success is the software ecosystem that AI developers can rely on. All new solutions do require a level of abstraction that can make it easier to adopt. So far, with the CUDA ecosystem and optimized AI frameworks around CUDA, NVIDIA has aced this domain.
Like the CPU domain, the GPU domain cannot be dominated by a selected few. Soon, there will be promising SoCs that can pitch themselves as the potential future of the AI World, which will also push GPU architecture innovation to its limit.
The next five years will reveal how the “Silicon Chip For AI World” segment will evolve, but it certainly is poised for disruption.