The Future of AI Chips Might Not Be GPU

  sonic0002        2024-06-21 22:43:00       1,776        0    

In the layout of AI computing architectures, the model of CPUs working in collaboration with accelerator chips has become a typical AI deployment solution. CPUs act as providers of basic computing power, while accelerator chips are responsible for enhancing computational performance, aiding in the efficient execution of algorithms. Common AI accelerator chips can be categorized into three main types based on their technological paths: GPU, FPGA, and ASIC.

In this competition, GPUs have emerged as the mainstream AI chip due to their unique advantages. So, how did GPUs stand out among many options? Looking ahead to the future of AI, will GPUs still be the only solution?

How Did GPUs Win the Present?

There is a close relationship between AI and GPUs.

Powerful Parallel Computing Capability

AI large models refer to large-scale deep learning models that require processing massive amounts of data and performing complex calculations. The core advantage of GPUs lies in their powerful parallel computing capabilities. Compared to traditional CPUs, GPUs can handle multiple tasks simultaneously, making them particularly suitable for processing large datasets and complex computational tasks. In fields like deep learning, which require extensive parallel computing, GPUs have shown unparalleled advantages.

Comprehensive Ecosystem

Secondly, to facilitate developers in fully utilizing the computational power of GPUs, major manufacturers provide rich software libraries, frameworks, and tools. For example, NVIDIA's CUDA platform offers developers a wealth of tools and libraries, making the development and deployment of AI applications relatively easy. This makes GPUs more competitive in scenarios that require rapid iteration and adaptation to new algorithms.

Versatility

GPUs were initially used for graphic rendering, but over time, their application areas have gradually expanded. Today, GPUs not only play a core role in graphic processing but are also widely used in deep learning, big data analysis, and other fields. This versatility allows GPUs to meet a variety of application needs, while specialized chips like ASICs and FPGAs are limited to specific scenarios.

Some compare GPUs to a versatile multi-functional kitchen tool, suitable for various cooking needs. Therefore, in most AI applications, GPUs are considered the best choice. However, this versatility often comes with the downside of not being "refined" enough for specific fields.

Next, let’s look at the challenges GPUs face compared to other types of accelerator chips.

GPUs Have Their Limitations Too

At the beginning of this discussion, we mentioned that common AI accelerator chips can be categorized into three main types based on their technological paths: GPU, FPGA, and ASIC.

FPGA(Field Programmable Gate Array) is a semi-custom chip that can be reprogrammed by users according to their needs. The advantage of FPGA is that it addresses the shortcomings of custom circuits while overcoming the limitations of the original programmable devices' gate count. It allows for flexible compilation at the chip hardware level and has lower power consumption compared to CPUs and GPUs. However, the downside is that hardware programming languages are more difficult, the development threshold is high, and the chip cost and price are higher. FPGAs are faster than GPUs and CPUs due to their customizable structure.

ASIC (Application Specific Integrated Circuit) is an integrated circuit designed and manufactured for specific purposes based on product requirements. The degree of customization for ASICs is higher compared to GPUs and FPGAs. The computational power of ASICs is generally higher than that of GPUs and FPGAs, but the initial investment is large, and the specialization reduces its generality. Once the algorithm changes, the computing power will significantly decrease, necessitating a redesign.

Now, let’s examine the disadvantages of GPUs compared to these two types of chips.

First, the theoretical performance per unit cost of GPUs is lower than that of FPGAs and ASICs

From a cost perspective, the generality of GPUs, FPGAs, and ASICs decreases from left to right, becoming more specialized, and the customization gradually increases. Correspondingly, the design and development costs also increase, but the theoretical performance per unit cost becomes higher. For example, using GPUs for software exploration is very suitable for classical algorithms or deep learning algorithms that are still in the laboratory stage. For technologies that are gradually becoming standard, FPGA is suitable for hardware acceleration deployment. For computing tasks that have become standard, dedicated ASIC chips are directly introduced.

From a company’s perspective, for large-scale data computing tasks, the deployment costs of mature GPUs and FPGAs with the same memory size and computing power are similar. If a company’s business logic changes frequently, such as every 1-2 years, the low development cost and fast deployment speed of GPUs are advantageous. If the company’s business changes every 5 years or so, although the development cost of FPGAs is high, the chip cost itself is much lower compared to GPUs.

Second, the computational speed of GPUs is inferior to that of FPGAs and ASICs

All three—FPGA, ASIC, and GPU—contain a large number of computing units, providing strong computational capabilities. When performing neural network computations, they are much faster than CPUs. However, because GPU architectures are fixed, the hardware-supported instructions are also fixed. In contrast, FPGAs and ASICs are programmable, allowing for flexible circuit modifications tailored to specific algorithms, offering significant speed advantages in many applications.

Specifically, GPUs excel in floating-point operations, making them suitable for high-precision neural network computations. FPGAs, although not adept at floating-point operations, excel in pipeline processing for network data packets and video streams. ASICs, depending on design costs, can have nearly limitless computing power, contingent on the hardware designer.

Third, GPUs have significantly higher power consumption compared to FPGAs and ASICs

GPUs are notoriously power-hungry, with single units consuming up to 250W or even 450W (e.g., RTX 4090). FPGAs, on the other hand, typically consume only 30-50W, primarily due to memory read operations. GPU memory interfaces (GDDR5, HBM, HBM2) have much higher bandwidth, approximately 4-5 times that of traditional DDR interfaces in FPGAs. However, reading DRAM consumes over 100 times more energy than reading SRAM. The frequent DRAM accesses by GPUs result in high power consumption. Additionally, the operating frequency of FPGAs (below 500MHz) is lower than that of CPUs and GPUs (1-3GHz), contributing to lower power consumption.

ASICs, designed for specific applications, offer better performance and power optimization for particular tasks, making them more efficient. Their targeted functionality often leads to higher execution efficiency and energy efficiency compared to FPGAs.

For instance, in fields like autonomous driving, deep learning applications for environmental perception and object recognition require fast computational response and low power consumption to avoid negatively impacting the vehicle’s range.

Fourth, GPUs have higher latency compared to FPGAs and ASICs

FPGAs have lower latency compared to GPUs. GPUs typically need to divide training samples into fixed-size batches to maximize parallelism, processing several batches simultaneously. In contrast, FPGA architectures are batch-free, allowing immediate output upon processing each data packet, providing a latency advantage. ASICs also achieve extremely low latency, as their task-specific optimization eliminates additional programming and configuration overheads present in FPGAs.

Given these limitations, why do GPUs remain the hot favorite for AI computation?

In the current market environment, the cost and power consumption requirements of major manufacturers have not reached stringent levels. NVIDIA's long-term investment and accumulation in the GPU field have made GPUs the most suitable hardware for large model applications. Although FPGAs and ASICs theoretically have potential advantages, their complex development processes present challenges that hinder widespread adoption. As a result, many manufacturers choose GPUs as their solution, leading to the emergence of a fifth potential issue.

Fifth, the production capacity of high-end GPUs is also a concern

OpenAI’s Chief Scientist Ilya Sutskever compared GPUs to the new era's bitcoin. In the context of surging computational power demands, NVIDIA’s B-series and H-series GPUs have become “hard currency.” 

However, despite the high demand, factors such as the tight supply of HBM and CoWoS and the strained advanced production capacity at TSMC mean that GPU production cannot keep pace with demand.

In such a scenario, technology giants need to be more flexible in responding to market changes, either by stockpiling more GPU products or seeking alternative solutions.

Many manufacturers have already started exploring and developing more specialized and refined computing equipment and solutions beyond GPUs. So, how will AI accelerator chips develop in the future?

Tech Giants Blaze New Trails

In today’s rapidly evolving technological landscape, where algorithms are updated monthly and data volumes are vast, GPUs indeed cater to a broader audience. However, once future business needs stabilize, FPGA and even ASIC might emerge as superior foundational computing devices.

Major chip manufacturers and tech giants have long been developing and producing specialized chips for deep learning and DNN calculations or semi-custom chips based on FPGA architecture. Notable products include Google’s Tensor Processing Unit (TPU) and Intel’s Altera Stratix V FPGA.

Google Bets on Custom ASIC Chips: TPU

As early as 2013, Google began secretly developing AI-focused machine learning algorithm chips for cloud computing data centers, aiming to replace NVIDIA GPUs. This self-developed TPU chip, publicly announced in 2016, is designed for large-scale matrix operations in deep learning models such as natural language processing, computer vision, and recommendation system models. In fact, Google had built its AI chip TPU v4 for data centers as early as 2020, but details were only disclosed in April 2023.

Notably, TPU is a customized ASIC chip designed from the ground up by Google specifically for machine learning workloads.

On December 6, 2023, Google announced its new multimodal large model, Gemini, available in three versions. According to Google’s benchmarks, the Gemini Ultra version demonstrated "state-of-the-art performance" in many tests, often outperforming OpenAI's GPT-4. Alongside Gemini's success, Google unveiled another significant innovation—the new self-developed TPU v5p, the most powerful TPU to date. According to official data, each TPU v5p pod, using a 3D toroidal topology, combines 8960 chips at a speed of 4800 Gbps/chip through the highest bandwidth inter-chip interconnect (ICI), with FLOPS and high-bandwidth memory (HBM) improving by 2x and 3x respectively compared to TPU v4.

In May this year, Google announced the sixth-generation data center AI chip, the Tensor Processing Unit—Trillium, set for delivery later this year. Google stated that the computing performance of the sixth-generation Trillium chip is 4.7 times higher than the TPU v5e chip, with an energy efficiency ratio 67% higher than v5e. This chip is designed to power technologies that generate text and other content from large models. The sixth-generation Trillium chip will be available to its cloud customers by the end of the year.

Reportedly, NVIDIA holds approximately 80% of the AI chip market share, with most of the remaining 20% dominated by various versions of Google’s TPU. Google does not sell chips but provides access through its cloud computing platform.

Microsoft: Launches Universal Chip Cobalt Based on Arm Architecture, ASIC Chip Maia 100

In November 2023, Microsoft unveiled its first self-developed AI chip, Azure Maia 100, at the Ignite conference, along with the Azure Cobalt chip for cloud software services. Both chips will be manufactured by TSMC using 5nm process technology.

According to reports, Nvidia's high-end products can sometimes sell for $30,000 to $40,000 each, and chips used for models like ChatGPT might require around 10,000 units, posing significant costs for AI companies. Large tech firms with substantial demand for AI chips are actively seeking alternative supply sources. Microsoft's decision to develop these chips internally aims to enhance the performance of AI products like ChatGPT while reducing costs.

Cobalt is a universal chip based on the Arm architecture with 128 cores, while Maia 100 is an ASIC chip specifically designed for Azure cloud services and AI workloads, boasting 105 billion transistors. These chips will be deployed in Microsoft Azure data centers to support services like OpenAI and Copilot.

Rani Borkar, Vice President of Microsoft's Azure chip division, stated that Microsoft has already started testing Maia 100 chips with Bing and Office AI products, and its primary AI partner and ChatGPT developer, OpenAI, is also conducting tests. Market analysts believe Microsoft's timing for AI chip development is opportune, coinciding with the ascent of large language models nurtured by companies like Microsoft and OpenAI.

However, Microsoft does not believe its AI chips can broadly replace Nvidia's products. Analysts suggest that if Microsoft's efforts succeed, it could potentially strengthen its position in future negotiations with Nvidia.

It is reported that Microsoft is expected to unveil a series of new developments in cloud hardware and software technologies at the upcoming Build conference. Of particular interest is Microsoft's plan to open access to its self-developed AI chip Cobalt 100 to Azure users.

Intel's Bet on FPGA Chips

Intel has highlighted the role of FPGA chips in early AI workloads, such as image recognition, which heavily rely on parallel performance. GPUs excel in parallel processing, making them common in machine learning and deep learning applications originally designed for video and graphics. GPUs offer exceptional parallel execution for repetitive tasks, achieving incredible speed improvements.

However, GPUs have limitations for AI tasks compared to ASICs, which are custom-built chips specifically tailored for deep learning workloads.

FPGAs, on the other hand, offer hardware customization capabilities integrated with AI, similar to GPUs or ASICs through programming. Their reprogrammable and reconfigurable nature makes FPGAs particularly suitable for the rapidly evolving field of AI, enabling designers to quickly test algorithms and accelerate products to market.

Intel's FPGA family includes products like Intel Cyclone 10 GX FPGA, Intel Arria 10 GX FPGA, and Intel Stratix 10 GX FPGA. These products offer advantages in I/O flexibility, low power consumption (or energy consumption per inference), and low latency, enhancing AI inference performance. Intel has recently introduced three new families: Intel Stratix 10 NX FPGA, new members of the Intel Agilex FPGA family like Intel Agilex D series FPGA, and a new Intel Agilex family device codenamed "Sundance Mesa". These families feature specialized DSP modules optimized for tensor mathematical operations, laying the groundwork for accelerated AI computations.

In March of this year, chip giant Intel announced the establishment of a new independent FPGA company, Altera. Intel had acquired Altera for $16.7 billion in June 2015, when Altera was the world's second-largest FPGA company. Nine years later, Intel decided to operate the FPGA business independently under the Altera name once again.

NPU (Neural Processing Unit) is also an ASIC chip modeled after human neural synapses. With the rise of deep learning neural networks, CPUs and GPUs have gradually struggled to meet the demands of deep learning. NPUs emerged specifically for neural network deep learning tasks, utilizing a "data-driven parallel computing" architecture particularly adept at handling vast amounts of multimedia data such as video and images. Unlike the Von Neumann architecture followed by CPUs and GPUs, NPUs integrate storage and computation, resembling the structure of human neural synapses.

Arm recently announced the launch of Ethos-U85 NPU, its third-generation NPU product targeting edge AI applications. Ethos-U85 is suitable for industrial automation, video surveillance, and similar scenarios, offering a fourfold increase in performance compared to its predecessors. It also improves energy efficiency by 20% and achieves an 85% utilization rate on commonly used neural networks. Designed to work with systems based on Arm Cortex-M/A processors, it can tolerate higher memory latency.

In addition, OpenAI is exploring its own AI chip development and evaluating potential acquisition targets. AWS has its own lineup of AI chips including the inference chip Inferentia and the training chip Trainium. Tesla, the electric vehicle manufacturer, is also actively involved in developing AI accelerator chips, primarily focused on autonomous driving. Tesla has introduced two AI chips so far: the Fully Self-Driving (FSD) chip and the Dojo D1 chip.

Last May, Meta revealed details of its data center project supporting AI workloads, mentioning the development of a custom chip called MTIA to accelerate the training of generative AI models. This marked Meta's first foray into AI custom chips. Meta stated that MTIA is part of a family of chips designed to accelerate AI training and inference workloads. MTIA utilizes the open-source chip architecture RISC-V and consumes only 25 watts of power, significantly lower than mainstream products from Nvidia and others. In April of this year, Meta announced the latest version of its self-developed chip MTIA. Analysts note that Meta aims to reduce its dependence on chip manufacturers like Nvidia.

Reference: AI芯片的未来,未必是GPU

MICROSOFT  ARM  INTEL  NVIDIA  GPU  OPENAI  CUDA 

       

  RELATED


  0 COMMENT


No comment for this article.



  RANDOM FUN

Is this how you debug?