Wafer Scale Engine-2 wafer-level chip data map (from: Cerebras)
This demonstration uses OpenAI’s 12 billion-parameter DALL-E, and all workloads do not need to be scaled to a platform spanning multiple accelerators, greatly reducing infrastructure and software complexity requirements.
It should be pointed out, however, that a single CS-2 system is already comparable to supercomputing – a 7nm single wafer (usually can accommodate hundreds of mainstream chips), with a staggering 2.6 trillion transistors, 850,000 cores, 40GB Integrated cache, and package power consumption up to 15kW.
Cerebras attempts to keep NLP models with up to 20 billion parameters on a single chip, significantly reducing the cost of training thousands of GPUs, the associated hardware requirements needed to scale, and eliminating the technical difficulty of partitioning models among them.
Cerebras pointed out that this is also one of the pain points of regular NLP workloads, which sometimes take months to complete.
Due to the high level of customization, each neural network being processed, the GPU specifications, and the networks that tie them together, are unique – these elements must be done before initial training and are not portable across systems.
As for OpenAI’s GPT-3 natural preview processing model, it has sometimes been able to write entire articles that you might mistake for a real person, with a staggering 175 billion parameters.
However, Gopher, launched by DeepMind at the end of 2021, has greatly increased this number to 280 billion, and Google Brain even announced that it has trained a Switch Transformer model with over a trillion parameters.
Cerebras CEO and co-founder Andrew Feldman said: A larger NLP model means it is more accurate.
But usually very few companies have the necessary resources and expertise to break down these large models and crunch them across hundreds, or thousands, of GPUs.
Because of this, we’ve only seen very few companies able to train large NLP models – which are too expensive, time-consuming, and difficult to use for the rest of the industry.
Today, Cerebras is proud to announce the general availability of GPT-3XL 1.3B, GPT-J 6B, GPT-3 13B and GPT-NeoX 20B, enabling the entire AI ecosystem to build large models in minutes and 2 to start training on the platform.
However, similar to the CPU field, frequency is only one of the indicators to measure performance. For example, Chinchilla tries to get better results than GPT-3 and Gopher by using fewer parameters (70 billion).