Learning Machines #1 - big models works well

by Laurent Cetinsoy published the 06/05/2024

Big ML models do perform well and that make some people sad

There is a trend in machine learning, that some people do not like too much,  especially in small labs, which is the growing size of machine learning models and their computing requirements. Indeed you hear more and more research article saying  "We trained a gigantic, billion parameter model on some enormous data-set for very long time with a very big number of GPU and reached state of the art performance". 

Doing that is, obviously, pretty expensive: Some experiments costing as much as 1 million USD have been reported, leaving behind researchers with low funding, frustrated and unable to compete with theses lines of research. Indeed, unlike OpenAI, not everyone can receives billions of cloud credits from Microsoft and all theses parameters do not fit in a single GPU. Theses approaches are nonetheless pursued because they, well... seem to work well ! 

Besides, operating such big (and small) models comes with engineering challenges and are costly to operate. It has been reported that AI product and AI based companies have less favorable economics fundamentals compared to standard web application companies. 

 

This trend started with deep learning and their layered models with routinely more than thousands, even millions, of parameters which can make any normal person working in classic statistic already crazy. People noticed that their were not over-fitting as much as they should. First, people thought about some some implicit regularization mechanism and now the currently most liked explanations are the lottery ticket hypothesis and the double descent curve. Training on the whole internet can also be seen as a regularization scheme : there is no generalization problem if the universe is your dataset.

People did not stop with such cheap models and some companies like google, started some crazy intensive computing AutoML experiments trying to find deep architectures automatically, with hundreds of GPU. AlphaGo, which beat the Go world champion was trained  with reinforcement learning on zillions of games.

Models recently grew in size with the emergence of attention modules. Such modules let the model learn to look at the most interesting parts of sentences when it do it works. They enabled the training of  much bigger NLP models for translation and text generation. GPT-2, developed by Open AI became the most famous billion parameter model and impressed with its nice performance. Recently GPT3 and GPT-Image showed even more improvements. 

Such big models may not disappear overnight as theses bigger models seems to have the nice property, besides being costly to train,  to generalize better than smaller models. Will these bigger models be the only one to show higher generalization (and generation property) or will we be able to have small models too ? 

To put that in context, keep in mind that human brains are actually pretty big in proportion to their body compared to other mammalians. They are, however, contrary to GPU based models, much more energy efficient. My brain, writing these very fine lines is consuming roughly 40 watts. 

So for researchers unable to throw thousand and more dollars for training models may be willing to find methods much more energy efficient.  And that could need some innovation in hardware. Indeed many companies are trying to developed co-processors to make AI much more efficient. The basic idea being : CPU and even GPUs are too generic processors. Let's build a processor with specific AI operations so that they AI will be much faster on my device than on CPU and GPU. This is for example the idea of Google TPU. However beating Nvidia seems pretty hard as google still buy Nvidia GPU for their cloud services. 

A recent article showed that often times, an approach succeed, not because it was intrinsically  better but because it was supported by the available hardware: see the hardware lottery.

One could remember the words of the pionner Alan Kay "researcher should design their own hardware". These approach could let scientists not be constrained by the hardward they have. If noble, this philosophy is requires much more work as design hardware is time consuming and many Machine learning practictionner do not have electronic background. However there is a hope ! If you have heard about FPGA you could think, let's do that ! And there is indeed very interesting things happening in this space : tools, which used to be pretty awful,  are become much more usable thanks to a growing community of open source base tools. 

 

Another hope comes to the improvement in models themselves and techniques to reduce their computational complexity like quantization, pruning and distillation. Model tend to become more efficient with time, at the faster pace than the Moore law. 

Share the article !