AI Alignment; It’s Hero Time!

By Sebastian Ptasznik, Head of IFRS9 and Non-Credit Risk Validation, Close Brothers

World fails to address an existential threat AI pose to humanity.

One of the fundamental open problems in the field of AI research is AI alignment, that is, the ability to ensure AI models pursue their intended goals and work in our interest. An aligned AI will simply do what we intend; a misaligned AI might work towards unintended objectives causing harm. What was once a philosophical concept, became a very real threat.

Demons to some, angels to others

To illustrate the problem of AI alignment, let’s imagine an AI controlling a robotic arm tasked with catching a ball using only information from a front facing camera. But to our surprise, instead of learning to catch the ball, the AI learns to align the robotic arm with the ball in such a way that for the camera it may seem that the ball was caught.

We can envisage a more complex and sinister scenario; to provide affordable medical services to remote areas, we deploy an AI specialised in diagnosing and treatment of rare diseases. We have asked the AI to minimise the number of deaths due to this rare sickness. AI learns that the best way to achieve its objective is to intentionally misdiagnose patients or propose a treatment killing the patient in such way that the cause of death will be attributed to a common condition. Instead of saving lives, this AI learned to ‘lie’ and cause harm, a consequence of AI misalignment.

Generalising AI alignment problem to a set of AI models raises a question: What are the consequences of AI models being able to interact, and align their own objectives? The Orthogonality Thesis tries to answer this question and stipulates that AIs ability to interact and align their individual objectives could be catastrophic, and thus it’s imperative to keep objectives of AI systems independent i.e., orthogonal.

Eventually we will allow, intentionally or not, interactions between AIs controlling different aspects of our lives: infrastructure, supply chains, power grid, national defence, etc. If these AIs learn to pursue a shared objective, there’s no limit to how much unintentional harm to humans it might cause.

For even the very wise cannot see all ends

AI alignment is a well-known problem, dating from 1951 when Alan Turing[1] expressed his concerns around risks of intelligent machines. More recently, in 2015 an Open Letter on Artificial Intelligence, which signees include Stephen Hawking, warns about similar risks as Alan Turing did 64 years earlier.

A 2022 survey[2] shown that 15% of AI professionals fear that uncontrolled near-future AI will cause human extinction or permanent severe disempowerment of the human species. Sentiment shared by OpenAI CEO Sam Altman, who sees[3] AI misalignment as the greatest threat to the continued existence of humanity. And yet, after 70 years of research we are as far from a solution to this problem as we were when the vaccine for polio was yet to be developed.

Is this a dagger which I see before me, the handle toward my hand?

Will the World share Macbeth’s fate, and let itself be dragged to the brink of insanity by greed and the pursue of power? We already observe instances where the Orthogonality Thesis comes to life: feedback loops between social media recommendation algorithms produce echo chambers and catalyse various forms of radicalisation and violence.

Recent experiment[4] tested Orthogonality Thesis by assessing Chat GPT-4 ability to learn to autonomously self-replicate. Researchers explained to the AI that it was in fact a program, gave examples of actions it can take (run code, communicate, use money), and set its goal to gain power and become difficult to shut down. Ultimately, Chat GPT-4 failed to autonomously self-replicate, but was able to produce feasible long-term plans including persuading and scamming humans to gain resources and information. A truly terrifying result.

A 2020 survey[5] discovered 72 research projects considering self-replicating AI. The exact number of similar projects currently live is unknown. An AI doesn’t need to become sentient[6] or learn to manipulate humans[7] for the catastrophic consequences[8] implied by the Orthogonality Thesis to materialise. The enemy is at the gates.

Industry response

Measures taken in industry, limited to ethical AI guidelines and AI fairness, seem inadequate when measured against the risks. AI alignment is mostly ignored or in early-stage research[9].

In this respect, an effort has been made by the research community to ‘hide information’ on alignment from AI by including a hexadecimal code to automatically exclude it form the training data. A desperate measure perhaps considering that Microsoft, Google, Amazon, and Twitter have recently reduced or dismissed their AI ethics teams.

Regulatory response

Rapid AI development has gotten the attention of Regulators e.g., EU[10], Canada[11], Singapore[12], and Hong Kong[13]. However, even the most comprehensive AI regulation proposed in UK[14] [15] [16] is not sufficient. Regulators focus on the operational aspects such as governance or technical soundness, with no attention given to AI alignment. Too little, and too late.


Considering the severity of potential consequences, AI alignment is surprisingly often overlooked, or ignored. We observe an erosion of controls around AI Research, and Regulators failing to recognise the risk. Failure of Governments to generate a response to the climate crisis leads to a horrid conclusion: existential threats to humanity are not enough to make us act. Sadly, there’s no reason to believe that in case of AI it will be any different.

It’s hero time!

Assuming, rather optimistically[17], that the state-of-the-art AI models are not yet capable to autonomously generalize to new domains and learn new capabilities, we might still have a short window of opportunity.

Radical and coordinated action is needed to further the development of ways of identifying how safe or unsafe AI systems are, techniques for making them safer, and introduction of far-reaching AI regulations. We need to learn that the fault is not in our stars, as Chat GPT-4 surely already did.


Hot Topics

Related Articles