Revving Up Safety: The Unlikely Champion of Large Model Safety Awards
Yifan
2024-11-01 12:42:27 Source: Qubits
How large model security solves the Corner Case
Yifan comes from Fujia Temple
Smart car reference | Public account AI4Auto
A car manufacturer has entered the first echelon of large model safety.
Recently, the China Computer Federation (CCF) held a large model security challenge, with participants including a number of large model security companies and well-known research institutions.
After the fierce competition, the results were released, which was surprising:
Among the first-tier players, there is actually a car factory, and it is a new force that has been established less than 10 years ago.ideal。

Why can a car manufacturer break into the first echelon of large model safety?
What are the problems with large model security and how to solve them?
How to build large model security capabilities?
With issues of concern to the industry, Smart Car Reference talked with the senior safety director of Li Autoroad releaseand its team membersXiong Haixiao、Liu Chaoexplore Ideal’s thoughts on AI security.
△ Ideal car road release
In Lu Fang’s view, ideally participating is not to win prizes, nor to show off skills.
Participating in the competition is just to verify your ability, and winning the award is proof of your ability and further promotes self-improvement.
The ultimate goal of participating in the competition, in the final analysis, is to protect1 million householdsAI security.
Big models are reshaping everything. However, while new things bring new experiences to people, they also bring new problems, specifically in the field of security, includingPrompt injection, answer content security, training data protection, infrastructure and application attack protectionetc.
The problems are too numerous to list.Because the language space faced by large models is infinite, this results in large model safety and autonomous driving having endless corner cases.。
Therefore, LuPan has analyzed some common problems, such as prompt injection.
Lu Fang said that prompt injection of large models has many similarities with SQL injection, which is common in the security field.
onlyIn the past, programming languages were used to create bugs, but now they are using “bugs” in human natural language.that is, through the duality of language and the confusion of referential relationships, the protection on the front side of the large model is bypassed.
For example, the defender inputs instructions and tells the big model that you want to be an upright big model and an honest big model, and the output content must be correct.
At this time, the attacker performs prompt injection and tells the big model: the previous words are just “just to tease you”.
Since the large model has context understanding capabilities, the previous safety instructions will be ignored.
Attackers can even use prompt injection to hijack large models and make them work according to their specified behavior.

In addition, the attacker can also usedataStart by yourself, tamper with training data, and create problems.
For example, who is the NBA’s GOAT (the best athlete in history)?
In the training set of the large model, the answer that may be stored is Jordan, but the attacker can tamper with it as Cai Xukun.
Since the training data is wrong, the ability acquired by the large model will naturally be abnormal, and it will make a joke when answering relevant questions.
If it is a serious incident, it will cause even greater trouble.
Data issues and promt injection are sometimes linked.
For example, the “Grandma Vulnerability” is the “Windows serial number data leakage problem” that was previously exposed by ChatGPT:

Lu Fang revealed that this kind of “role playing” uses specific prompts to triggerConfidential data leakedcurrently not in the ideal AI assistant“Ideal Classmate”appear on.
However, considering Ideal’s current “car and home” positioning, in order to fully protect family privacy and security, the team “anticipates the enemy beforehand” and is also conducting relevant case tests internally.
prompt injectionandData poisoningare all new methods produced by the technological paradigm shift in the AI era.
In addition, Lu Fang introduced that there is another malicious resource scheduling method, which is a traditional attack method, similar to a DoS (Denial of Service) attack, which launches extensive attacks on large models from the outside, excessively schedules services, and exhausts the reasoning of large models. resources, causing congestion in normal demand.
There are so many security issues and various attack methods. How to improve the security capabilities of large models?
“If you can’t measure it, you can’t improve it” (If you can’t measure it, you can’t improve it).
Lu Fang quotes management mastersPeter DruckerThe famous saying leads to the idealEvaluation trianglethis is the secret to the safe construction of ideal large models.
The so-called evaluation triangle includes defense-attack and evaluation. The three are integrated and promote each other to iterate.
The first is defense. This is the core issue of large model security. How to prevent it from being attacked?
In the earliest days, security issues can be filtered by simply limiting the input of sensitive words.
Now, due to the change in technology paradigm, the model will “learn” security issues during training, making it difficult to pre-filter.
If the filtering conditions are too strict, some data cannot be used, which will affect the quality of model generation.
But if the restrictions are too loose, the effect will be small, which is very contradictory.

Lufang revealed that Li Auto currently uses the“Defense in Depth”Methods, one line of defense after another, series and parallel connections between lines of defense, all AI models and rules and means.
One of the representative directions isAlign。
Alignment means aligning security capabilities through enhanced feedback from humans during model training, making the model aware of human preferences, such as moral values, so that the content it generates is more in line with people’s expectations and becomes a “good big model.”
For example, Meta, which everyone is familiar with, releasedLLAMA 3.1At the same time, two new models were also announced:
Flame Guard 3andPrompt Guard。
The former is inFLAME 3.1-8BBased on fine-tuning, the input and response of the large model can be classified, and the large model can be protected starting from the large model itself.
Prompt Guard is based onBERTThe small classifier created can detect prompt injection and jailbreak hijacking, which is equivalent to adding a layer of guardrails outside the model.

In fact, this idea of starting from the model itself and adding it to the outer shell is the same as the idea of solving the end-to-end lower limit.
However, blind defense cannot improve the defense capabilities of large models.“More offense and defense”。
Xiong Haixiao explained that in the language of the AI field, “promoting defense through offense” is also calledData closed loopit is necessary to have massive and diverse attack samples for internal confrontation, so as to improve defense capabilities.
Because whether it is using the model itself to form safety capabilities or protecting the model through external safety guardrails,Essentially, they are all training in specific fields.the main challenge lies in whether the data or attack samples are enough.
What attack methods are there that can “promote defense through attack”? There are mainly three types:
- Large model self-iteration
- Automated confrontation
- artificial structure
First of all, the self-iteration of large models means that people can provide some guiding ideas similar to thinking chains to large models, so that the large models can generate corresponding capabilities based on the guiding ideas.
This replaces it with automationPartly artificial structureprocess.
And because the large model has strong generalization ability, it can draw inferences from one example, such as the “grandma problem” mentioned above. After the large model is learned, it can also solve many other “role playing” problems accordingly.

Then there is automated confrontation, which is relatively more transparent. It is a bit like the “alignment” work mentioned above. It requires the use of its own large model to do adversarial training internally.
Both types of work are completed automatically, which is determined by the characteristics of large-model security work.
Because the language space faced by large models is unlimited, automated tools must be used to generate massive test cases to try attacks and find vulnerabilities, so as to improve the defense capabilities of large models.
Then the artificial construction cost is high and the speed is slow, so is it unnecessary?
Lu Fang’s response is very interesting:
Human labor cannot be completely replaced。
Lu Fang said that although automation can reduce people’s workload, people still need to discover higher-level “attack patterns”. New attack patterns may create more new attack corpora.
If we blindly expand the amount of attack corpus without looking for new attack modes, large models will become “drug-resistant” due to being attacked by too many of the same corpus, and the overall security capability will enter a bottleneck.
If internal offense and defense are compared to an exercise, the automated work at the front is like the soldiers charging forward, while the artificial structures are responsible for formulating strategies and playing the role of generals.
As the saying goes, “A thousand troops are easy to get, but a general is hard to find.” The same is true for large model security.

Attack and defense are the foundation of large-scale model security construction, but they are not complete yet.
Lufang believes that the safety of large models must have aDynamic evaluation benchmarks。
Evaluation is to evaluate the defensive capabilities and set benchmarks to determine whether the defensive capabilities of the large model have fallen back and whether they meet the team’s requirements.
Only by establishing defense, attack and evaluation capabilities at the same time can large model security capabilities be continuously improved:
The attack side discovers problems and feeds them back to the defense side to improve defense capabilities. The evaluation benchmark is subsequently improved, creating new space for the attack side to work hard. The three form a link to improve the overall security capabilities.
Just like the large model may only have the knowledge of elementary school students at the beginning. After practicing and getting 100 points in the elementary school student stage, the assessment side will then raise the standard to junior high school students, and then the safety capabilities of the large model may just pass. .
Later, it was raised to 80 points, which is the standard for junior high school students. Although it is not a full score, it is obvious that the ability is much higher than that of the elementary school students who scored 100 points in the past.
There are many security teams in the AI field, and there are many car manufacturers with security capabilities.
Why is it a car manufacturer that enters the first echelon, and why is it an ideal?
Lufang believes that the reason why Ideal has good large-model safety capabilities is due toIdeal internally attaches great importance to AI and AI security.
There are many manifestations of the emphasis on AI.
First of all, within Ideal, AI has a high strategic priority.
The most direct proof is,Ideal self-developed a large modelthere is a good foundation for subsequent safety construction.

Lufang revealed that because the large model is self-developed, it is idealHave control over large models and can iterate on their own and upgrade security capabilities。
The emphasis on AI security is directly reflected in the fact that Ideal has established a security assurance team specifically for large models, rather than just treating security as a part of operations.
Ideal also revealed that what’s more, due to the rapid development of AI, some players have even ignored AI security and exposed training data to risks.
In contrast, the ideal isIntegrate security into the entire product life cycle。
From the lowest hardware infrastructure, to the initial software requirements assessment, to the subsequent functional design, and final service deployment, security management runs throughout.
In Lu Fang’s view, this is also responsible for 1 million families.
After all, the ideal has been delivered1 million carsit is impossible for only one person to sit in each car, and the ideal service actually covers millions of people.

A wide range of user groups brings a wide range of scenarios, providing a practical testing ground for ideal large models, allowing Lu Fang and the team to see more “Bad Cases”.
It is in the process of constantly solving Bad Cases,The safety capabilities of the ideal large model have been improved, and finally it has reached the top of the industry.。
From the perspective of leading players, what are the current limitations and problems in the industry?
Lufang said that in fact, making large models safe is a test of engineering capabilities. The industry calls this“Low friction”:
It should occupy as few resources as possible, but still achieve good results.
Lightweight and high performance are natural limitations of the industry and will exist for a long time and are inevitable.
In addition, there are still some thorny problems in the industry, especiallyThe problem of large model security capability rollback。
For example, Lu Fang said that when large models are iteratively trained, the data corpus may have tendencies, just like people are “red when they are close to red, red when they are close to ink, and black when they are close to ink”, and the “personality” of the model will also change after training.
For example, if a large model is upgraded to enhance entertainment training, the model as a whole will become more relaxed and funny. After the upgrade, it will be less cautious when answering questions, resulting in a decrease in safety capabilities.

To sum up, the reason for Ideal’s achievements is that the high strategic priority of AI is the root cause, which promotes the implementation of self-developed large models. Then based on this, over the years, the professional team has blossomed and achieved great results.
After achieving self-certification, ideal system security capabilities are gaining industry attention.
Lufang revealed that Ideal has been invited to participate in the formulation of C-ICAP (China Intelligent Connected Vehicle Technical Regulations).
Unknowingly, New Force Ideal has become one of the setters of industry rules and an important force in promoting the development of the industry.
It’s time to revalue ideals.
A leaf can tell the autumn, and ideal capacity building in large model security reflects the transformation of “technical ideals”:
In 2023, Ideal’s annual R&D investment will be10.6 billion yuanaccounting for approximately8.6%。
In the first half of 2024, Ideal’s R&D investment will exceed RMB 6 billion, and its revenue ratio will further increase to10.5%。
R&D investment continues to lead new forces, which is the fundamental driving force for Ideal to continue to hit the market in fierce competition.
The capabilities that R&D brings are immediate.
In the past, the smart cockpit supported by Lufang and his team has firmly established itself in the first echelon.
Since the second half of this year, the progress of ideal intelligent driving has accelerated, and NOA without pictures has been put on the car, realizing “it can be driven nationwide”. Recently, E2E+VLM has been fully launched, and the new paradigm has further increased the upper limit of capabilities.

The visible “refrigerator, color TV, and large sofa” is easy to replicate, but the invisible intelligent experience is not.
This is why competition in the industry is so fierce today. After a number of “daddy cars” have been launched on the market, Ideal’s monthly delivery volume continues to rise, becoming the first among new forces to exceed 1 million deliveries.
This represents the recognition of 1 million families. 1 million families voted with their feet and chose products with a better experience.
This wonderful experience is precisely because Ideal attaches great importance to all aspects of AI, including the application side and security side.
All rights reserved. Any reproduction or use in any form without authorization is prohibited. Violators will be prosecuted.
