Newsletter

Hyena can achieve the same accuracy as GPT-4, but use 100 times less computing power

The new technology, called Hyena (meaning “hyena”), can achieve the same accuracy as GPT-4, but uses 100 times less computing power than the latter.

Despite the global buzz around AI chatbot Open AI ChatGPT and its latest AI language model, GPT-4, these language models are just software applications at the end of the day. Like all apps, they have technical limitations.

In March this year, artificial intelligence scientists at Stanford University and the MILA AI Institute in Canada published a joint paper and proposed a new technology (Hyena). This technology is even more efficient than GPT-4 or any similar AI technology, it can take a lot of data and turn it into the answer the user wants.

this is called Hyena Using only a fraction of its computing power, his technology was able to achieve par accuracy with GPT-4 on benchmarks such as question answering. In some cases, Hyena can handle large amounts of text, while GPT-4 can only process no more than 25,000 words at a time.

Google scientist Ashish Vaswani and his colleagues published a paper called “Attention Is All You Need” (“Attention Is All You Need”) in 2017, which is a milestone in the field of artificial intelligence research. This paper gives a very detailed introduction to the Transformer model (neural network structure) A Transformer based trainable neural network can be built in the form of stacked Transformers It is good at processing language comprehension tasks and requires less power computerized. The author wrote in the paper: “The results based on the implementation of the billion-parameter model show that we may not have all the attention we need.” Transformer has great potential and has become the basis of many major language models, such as ChatGPT.

However, the Transformer neural network model has a major flaw. When it processes a large amount of input information, it needs to learn from the “attention mechanism” of the human brain, that is, only select some key information inputs to process, therefore in order to improve the efficiency of the neural network.

This attention mechanism has “quadratic computational complexity”, its time and storage complexity is quadratic in sequence length, and its ability to deal with long text sequences is very poor. The inherent flaw of this​​​​​​​​​​​​​​​​​​​​​​​​​​​​ is that it includes ChatGPT and GPT-4 All major language programs including . This quadratic complexity means that the time it takes for ChatGPT to generate an answer increases with the amount of input data.

To some extent, if the prompt content is too input, either the program cannot provide the answer, or it must have enough computing power to meet the operational needs, which will lead to a surge in computing needs of artificial intelligence chatbots.

A new paper, “The Hyena Hierarchy: Towards In More Convolutional Language Models’ (Towards Larger Convolutional Language Models’), lead author Michael Poli from Stanford University and his colleagues propose using a “subquadratic function”, or Hyena, to replace The converter attention function.

The author did not explain the origin of the name “Hyena”, but people can roughly imagine different reasons. Hyena, also translated as “hyena”, is an animal that lives in Africa and can hunt for miles. In a way, a very powerful language model can act like a hyena, processing tens of thousands of texts in search of “answers”.

But as the title suggests, what the author is really concerned about is “hierarchy”. The hyena family has a strict hierarchy. In general, the hyena queen is the most noble, followed by the cubs, and the lowest status is the male hyena. The queen hyena leads and dominates the entire group, enjoying the highest status. This “hierarchy” establishes the dominance of the queen hyena. As you can see, Hyena programs repeatedly apply a series of very simple operations in a similar way, combining them to form a data processing hierarchy. That is why the program was named “Hyena”.

The special authors of this paper include many outstanding people from the field of artificial intelligence, such as Yoshua Bengio, the scientific director of the MILA Institute of Artificial Intelligence in Canada, who is the winner of the 2019 Turing Award (the equivalent of the computer field). the Nobel Prize). Bengio is credited with developing attention mechanisms long before Vaswani and his team applied them to Transformer. Christopher Ré, associate professor of computer science at Stanford University and co-author, has helped develop the concept of artificial intelligence as “software 2.0” in recent years.

In order to find an alternative to the “quadratic computational complexity” of the attention mechanism, Poli and his team set out to study how the attention mechanism works.

A recent practical area of ​​research in the science of artificial intelligence, called mechanistic interpretation, is gaining insight into the internals of neural networks – how the attention mechanism works. You can think of it as taking apart a computer, looking at its individual parts, and figuring out how it works.

Polly and his team cite a series of experiments by Nelson Elhage, a researcher at the artificial intelligence startup Anthropic, who conducted a global analysis of Transformer’s algorithmic structure and explained Transformer’s fundamentals. What is the content of the work in processing and production text, and which deeply explores the working principle of the attention mechanism behind it.

Basically, Elhag and his team discovered that attention works at the most basic level through very simple manipulation of computers. Suppose given an input, “Professor Judy is so busy…because Professor X…”, X points to “Judy”. The attention mechanism is to look at the last word “Teacher” in the context, and search for a specific word related to the last word in the context, and then output this related word as the model.

As another example, if a person enters a sentence from “Harry Potter and the Sorcerer’s Stone” (Harry Potter and the Sorcerer’s Stone) in ChatGPT, such as “Mr. Dursley was the director of a company called Grunnings…” , then enter “Durs”, the beginning of the name, may be enough to prompt the program to complete the name “Dursley”, because it has seen this name in the book “Harry Potter and the Sorcerer’s Stone”. The system can copy the entry of the “weak” character from memory to automatically complete the sentence output.

However, as the number of words increases, the attention mechanism suffers from quadratic complexity. More text requires more “weights” or parameters to run on.

As the authors write: “The Transformer block is a powerful tool for sequence modeling, but it is not without limitations. The most notable of these is the computational cost, which grows rapidly as the length the content of the input sequence increases.”

Although OpenAI has not disclosed the technical details of ChatGPT and GPT-4, it is understood that they may have a trillion or more such parameters. Running these parameters requires more GPU chips, thereby increasing the computational cost.

To reduce the cost of secondary computations, Poli and the team replaced the attention mechanism with a so-called “convolutional model”, one of the oldest computational models in AI programming, refined as far back as the 1980s . The convolutional model is equivalent to a filter program that can select items from the data, whether it is an image pixel or text format, it is supported.

Poli and his team did a hybrid study, combining work done by Stanford University researcher Daniel Y. Fu and his team with research by David Romero of the Vrije University in Amsterdam and colleagues, allowing the program to change the size of the filter device is dynamic. This ability to adapt flexibly reduces the number of parameters or weights required by the program.

Convolutional models can be applied to an unlimited amount of text without needing more and more parameters to keep the program running. As the author says, this is a “no focus” method.

“Hyena is able to significantly reduce the gap with attention mechanisms, solving corresponding complexities with a smaller computing power budget,” wrote Poli and his team.

In order to demonstrate Hyena’s capabilities, the authors tested the program against a series of benchmarks that determine how well a language program performs on various artificial intelligence tasks.

One such test is The Pile, an 825 GiB open source language modeling dataset collected in 2020 by the non-profit AI research organization Eleuther.ai. These texts are assembled from 22 smaller high-quality datasets, such as PubMed, arXiv, GitHub, USPTO, etc., which are more specialized than others.

The main challenge for the program was how to create a new word when fed a bunch of new sentences. Starting in 2018, Hyena was able to achieve similar accuracy to OpenAI’s original GPT program with 20 percent fewer computational operations, the researchers wrote. This is the first unobtrusive spin model that matches the quality of GPT.

Next, the authors tested the program on a compilation task called SuperGLUE, presented in 2019 by academics at New York University, Facebook AI Research, Google’s DeepMind division, and the University of Washington.

For example, when given the hypothesis “my body casts a shadow on the grass”, and given two reasons for this phenomenon: “the sun has risen” or “the grass was cut”, and asked the program to choose one of them. For reasonable reasons, it will output “The sun is up” as the output text.

When dealing with multitasking, the Hyena model scores at or close to the score of the GPT version, yet uses less than half the training data as GPT. Even more interesting is what happens when the author tries to increase the length of the input string and finds that the more characters, the better the performance and the less time it takes .

Poli and the team believe that they have not only tried a different approach with Hyena, but also solved the problem of quadratic computational complexity, making a qualitative change in the difficulty of the program’s calculation results.

Down the road, they believe, breaking the quadratic computation barrier is a key step towards deep learning, for example composing long pieces of music or processing gigapixel images using entire textbooks as contextual cues.

The authors wrote that Hyena was able to use a screening program that could scale more efficiently to tens of thousands of words, meaning that there was almost no limitation to the context of the query language program, which could even recall text or include previous conversations. .

They propose that Hyena is not artificially constrained and can learn any element of the “input prompt”. Moreover, apart from text, the program can also be applied to different types of data, such as images, and perhaps video and audio.

It is worth noting that the Hyena program shown in the paper is small compared to GPT-4 or even GPT-3. GPT-3 has 175 billion parameters or weights, while Hyena has a maximum of 1.3 billion parameters. So, it remains to be seen how Hyena performs when compared comprehensively to GPT-3 or GPT-4.

But if the Hyena program also proves to be efficient on a larger scale, the program could become very popular – similar to the popularity that attention mechanisms have achieved during the last decade.

As Poli and his team conclude: “Simpler quadratic models, such as Hyena, based on a simple set of guiding principles and benchmarks of mechanistic interpretation, can form the basis of large efficient language models.”