Ministry of State Safety Calls Data Contamination a Fundamental Challenge to AI Security, Citing National Security and Socioeconomic Dangers
In an article revealed on its official WeChat account, the ministry mentioned AI knowledge sources are sometimes polluted by mixed-quality content material containing false data, fabricated narratives and biased viewpoints. As AI is more and more built-in into China’s socioeconomic sectors, such contamination poses dangers to high-quality improvement and nationwide safety, it mentioned.
Information serves because the important basis for AI techniques, offering the uncooked materials for fashions to study patterns, make selections and generate content material, the ministry mentioned. It warned that compromised knowledge high quality instantly undermines mannequin reliability. Citing analysis, it famous that even a small contamination degree — comparable to 0.01 % of false textual content — can enhance dangerous outputs by 11.2 %.
The ministry additionally highlighted the hazard of “recursive air pollution”, during which false content material generated by AI turns into a part of coaching datasets for future fashions, resulting in compounding errors. Actual-world dangers embody monetary market manipulation by means of fabricated data, public panic triggered by misinformation and life-threatening medical misjudgments from corrupted diagnostic algorithms, it mentioned.
To counter these threats, the ministry proposed stricter supply supervision below present cybersecurity and knowledge safety legal guidelines, complete threat assessments and systematic data-cleansing frameworks. It pledged to collaborate with related businesses to safeguard AI and knowledge safety below China’s nationwide safety framework.
Zhang Xi, deputy dean and professor on the Faculty of Our on-line world Safety on the Beijing College of Posts and Telecommunications, mentioned China faces specific vulnerability because of a scarcity of high-quality Chinese language-language coaching knowledge. Chinese language knowledge makes up only one.3 % of worldwide large-model datasets, he mentioned.
This shortage, together with copyright restrictions and insufficient knowledge infrastructure, has pressured home builders to depend on lower-quality sources comparable to machine-translated or artificial content material, which worsens knowledge air pollution and hinders progress in Chinese language AI improvement, he mentioned.
Zhang cited the GPT-3 mannequin, which was educated on 750 gigabytes of information, and China’s DeepSeek-V3 mannequin, educated on 14.8 trillion high-quality textual content fragments. These datasets are drawn from huge libraries of books, educational papers, on-line texts and code. However because of their scale, handbook inspection is neither possible nor cost-effective, making knowledge contamination an more and more severe bottleneck, he mentioned.
Polluted coaching knowledge additionally creates unpredictable dangers in high-stakes fields comparable to drugs, autonomous driving and nationwide protection, Zhang mentioned. He cited a examine during which the insertion of 5,000 fabricated medical data raised misdiagnosis charges by 73 %. In one other instance, inserting three manipulated picture frames induced autonomous autos to mistake pedestrians for rubbish luggage, resulting in 92 % collision charges in testing.
Zhang additionally warned of malicious knowledge poisoning campaigns, during which adversarial actors inject content material opposite to China’s core socialist values. He pointed to foreign-developed fashions that generated separatist content material associated to the Xizang autonomous area for instance.
To guard knowledge sovereignty, Zhang advocated for better funding in home knowledge assortment and the institution of nationwide public knowledge platforms. He additionally known as for authorized mechanisms to criminalize malicious knowledge poisoning and assign legal responsibility for knowledge contamination brought on by negligence, with duties clarified for builders, knowledge suppliers and operators.
Shen Yang, a professor at Tsinghua College’s Faculty of Journalism and Communication and School of AI, outlined AI knowledge air pollution because the inclusion of inaccurate, incomplete, biased or intentionally manipulated content material in coaching knowledge.
This basically weakens AI fashions’ comprehension, judgment and output reliability, he mentioned.
Shen in contrast polluted coaching knowledge to “cooking with spoiled components”.
He mentioned malicious actors might search to govern AI on delicate matters, mislead the general public, undermine opponents or probe vulnerabilities in AI techniques. Whereas such acts are often remoted relatively than coordinated conspiracies, their cumulative influence can erode public belief in AI, he mentioned.
For most of the people, Shen mentioned it’s important to know that AI-generated content material can form — or distort — their notion of actuality. “They should see by means of the logic behind AI, identical to figuring out the motives behind individuals’s phrases,” he mentioned.
Keep forward of the curve with NextBusiness 24. Discover extra tales, subscribe to our publication, and be a part of our rising group at nextbusiness24.com