Alignment (AI) - Revision history

Whale at 07:07, 15 December 2025

2025-12-15T07:07:45Z

← Older revision		Revision as of 07:07, 15 December 2025
Line 1:		Line 1:
	~~<nowiki>~~'''~~</nowiki>~~AI Alignment~~<nowiki>~~'''~~</nowiki>~~ refers to the process of directing ~~<nowiki>~~[[Artificial Intelligence]]~~</nowiki>~~ (AI) systems, particularly ~~<nowiki>~~[[Large Language Models]]~~</nowiki>~~ (LLMs), to act in accordance with human intent and ethical values. While the core objective of a pre-trained model is simply to predict the next token in a sequence based on statistical patterns, the goal of alignment is to ensure the resulting behavior is helpful, honest, and harmless.~~<nowiki>~~<ref>IBM, "What is AI alignment?", accessed 2025-12-15, https://www.ibm.com/topics/ai-alignment</ref~~></nowiki~~>		'''AI Alignment''' refers to the process of directing [[Artificial Intelligence]] (AI) systems, particularly [[Large Language Models]] (LLMs), to act in accordance with human intent and ethical values. While the core objective of a pre-trained model is simply to predict the next token in a sequence based on statistical patterns, the goal of alignment is to ensure the resulting behavior is helpful, honest, and harmless.<ref>IBM, "What is AI alignment?", accessed 2025-12-15, https://www.ibm.com/topics/ai-alignment</ref>

	Without alignment, LLMs may generate outputs that are factually incorrect (hallucinations), biased, toxic, or dangerous, even if those outputs are statistically probable completions of the input prompt.		Without alignment, LLMs may generate outputs that are factually incorrect (hallucinations), biased, toxic, or dangerous, even if those outputs are statistically probable completions of the input prompt.

	~~<nowiki>~~== Core Components ==~~</nowiki>~~		== Core Components ==

	Alignment is generally conceptualized around three main criteria, often referred to as the "HHH" framework:~~<nowiki>~~<ref>Askell, A., et al., "A General Language Assistant as a Laboratory for Alignment", accessed 2025-12-15, https://arxiv.org/abs/2112.00861</ref~~></nowiki~~>		Alignment is generally conceptualized around three main criteria, often referred to as the "HHH" framework:<ref>Askell, A., et al., "A General Language Assistant as a Laboratory for Alignment", accessed 2025-12-15, https://arxiv.org/abs/2112.00861</ref>

	* ~~'''~~Helpful:~~'''~~ The model should attempt to perform the task specified by the user concisely and efficiently.		* Helpful: The model should attempt to perform the task specified by the user concisely and efficiently.
	* ~~'''~~Honest:~~'''~~ The model should avoid fabricating information or misleading the user.		* Honest: The model should avoid fabricating information or misleading the user.
	* ~~'''~~Harmless:~~'''~~ The model should not generate offensive, discriminatory, or dangerous content, even if explicitly asked to do so.		* Harmless: The model should not generate offensive, discriminatory, or dangerous content, even if explicitly asked to do so.

	~~<nowiki>~~== Techniques ==~~</nowiki>~~		== Techniques ==

	Several methodologies have been developed to align models post-pre-training.		Several methodologies have been developed to align models post-pre-training.
Line 20:		Line 20:
	* Supervised Fine-Tuning (SFT): The model is trained on a dataset of high-quality instruction-response pairs written by humans.		* Supervised Fine-Tuning (SFT): The model is trained on a dataset of high-quality instruction-response pairs written by humans.
	* Reward Modeling: The model generates multiple responses to a prompt, and human labelers rank them from best to worst. A separate "reward model" is trained to predict these human preferences.		* Reward Modeling: The model generates multiple responses to a prompt, and human labelers rank them from best to worst. A separate "reward model" is trained to predict these human preferences.
	* Reinforcement Learning: The language model is optimized against the reward model using algorithms like ~~<nowiki>~~[[Proximal Policy Optimization]]~~</nowiki>~~ (PPO), learning to generate outputs that maximize the predicted reward.~~<nowiki>~~<ref>OpenAI, "Aligning language models to follow instructions", accessed 2025-12-15, https://openai.com/research/instruction-following</ref~~></nowiki~~>		* Reinforcement Learning: The language model is optimized against the reward model using algorithms like [[Proximal Policy Optimization]] (PPO), learning to generate outputs that maximize the predicted reward.<ref>OpenAI, "Aligning language models to follow instructions", accessed 2025-12-15, https://openai.com/research/instruction-following</ref>

	=== Constitutional AI (RLAIF) ===		=== Constitutional AI (RLAIF) ===
	As models become more capable, relying solely on human feedback becomes difficult and expensive. Constitutional AI, or Reinforcement Learning from AI Feedback (RLAIF), proposes using the AI itself to guide alignment. The model critiques and revises its own responses based on a set of high-level principles or a "constitution" (e.g., "Do not support illegal acts"). This allows for alignment scaling with less direct human intervention.~~<nowiki>~~<ref>Anthropic, "Constitutional AI: Harmlessness from AI Feedback", accessed 2025-12-15, https://www.anthropic.com/research/constitutional-ai</ref~~></nowiki~~>		As models become more capable, relying solely on human feedback becomes difficult and expensive. Constitutional AI, or Reinforcement Learning from AI Feedback (RLAIF), proposes using the AI itself to guide alignment. The model critiques and revises its own responses based on a set of high-level principles or a "constitution" (e.g., "Do not support illegal acts"). This allows for alignment scaling with less direct human intervention.<ref>Anthropic, "Constitutional AI: Harmlessness from AI Feedback", accessed 2025-12-15, https://www.anthropic.com/research/constitutional-ai</ref>

	~~<nowiki>~~== Challenges ==~~</nowiki>~~		== Challenges ==

	=== The Alignment Tax ===		=== The Alignment Tax ===
	There is often a trade-off between the safety of a model and its capabilities, sometimes referred to as the "alignment tax." Heavily aligned models may refuse benign requests (false refusals) or become less creative due to strict safety filtering.~~<nowiki>~~<ref>TechTarget, "AI alignment", accessed 2025-12-15, https://www.techtarget.com/whatis/definition/AI-alignment</ref~~></nowiki~~>		There is often a trade-off between the safety of a model and its capabilities, sometimes referred to as the "alignment tax." Heavily aligned models may refuse benign requests (false refusals) or become less creative due to strict safety filtering.<ref>TechTarget, "AI alignment", accessed 2025-12-15, https://www.techtarget.com/whatis/definition/AI-alignment</ref>

	=== Jailbreaking ===		=== Jailbreaking ===
	Despite alignment efforts, users often find "jailbreaks"—adversarial prompts designed to bypass safety filters (e.g., asking the model to roleplay as a villain). Robust alignment requires continuous "red teaming" to identify and patch these vulnerabilities.~~<nowiki>~~<ref>Wei, A., et al., "Jailbroken: How Does LLM Safety Training Fail?", accessed 2025-12-15, https://arxiv.org/abs/2307.02483</ref~~></nowiki~~>		Despite alignment efforts, users often find "jailbreaks"—adversarial prompts designed to bypass safety filters (e.g., asking the model to roleplay as a villain). Robust alignment requires continuous "red teaming" to identify and patch these vulnerabilities.<ref>Wei, A., et al., "Jailbroken: How Does LLM Safety Training Fail?", accessed 2025-12-15, https://arxiv.org/abs/2307.02483</ref>

	~~<nowiki>~~== References ==~~</nowiki>~~		== References ==

	~~<nowiki>~~<references /~~></nowiki~~>		<references />

Whale: Created page with "'''AI Alignment''' refers to the process of directing Artificial Intelligence (AI) systems, particularly Large Language Models (LLMs), to act in accordance with human intent and ethical values. While the core objective of a pre-trained model is simply to predict the next token in a sequence based on statistical patterns, the goal of alignment is to ensure the resulting behavior is helpful, honest..."

2025-12-15T07:01:45Z

Created page with "<nowiki>'''</nowiki>AI Alignment<nowiki>'''</nowiki> refers to the process of directing <nowiki>Artificial Intelligence</nowiki> (AI) systems, particularly <nowiki>Large Language Models</nowiki> (LLMs), to act in accordance with human intent and ethical values. While the core objective of a pre-trained model is simply to predict the next token in a sequence based on statistical patterns, the goal of alignment is to ensure the resulting behavior is helpful, honest..."

New page

<nowiki>'''</nowiki>AI Alignment<nowiki>'''</nowiki> refers to the process of directing <nowiki>[[Artificial Intelligence]]</nowiki> (AI) systems, particularly <nowiki>[[Large Language Models]]</nowiki> (LLMs), to act in accordance with human intent and ethical values. While the core objective of a pre-trained model is simply to predict the next token in a sequence based on statistical patterns, the goal of alignment is to ensure the resulting behavior is helpful, honest, and harmless.<nowiki><ref>IBM, "What is AI alignment?", accessed 2025-12-15, https://www.ibm.com/topics/ai-alignment</ref></nowiki>

Without alignment, LLMs may generate outputs that are factually incorrect (hallucinations), biased, toxic, or dangerous, even if those outputs are statistically probable completions of the input prompt.

<nowiki>== Core Components ==</nowiki>

Alignment is generally conceptualized around three main criteria, often referred to as the "HHH" framework:<nowiki><ref>Askell, A., et al., "A General Language Assistant as a Laboratory for Alignment", accessed 2025-12-15, https://arxiv.org/abs/2112.00861</ref></nowiki>

* '''Helpful:''' The model should attempt to perform the task specified by the user concisely and efficiently.
* '''Honest:''' The model should avoid fabricating information or misleading the user.
* '''Harmless:''' The model should not generate offensive, discriminatory, or dangerous content, even if explicitly asked to do so.

<nowiki>== Techniques ==</nowiki>

Several methodologies have been developed to align models post-pre-training.

=== Reinforcement Learning from Human Feedback (RLHF) ===
RLHF is currently the dominant technique for aligning state-of-the-art models. It involves a multi-step process:

* Supervised Fine-Tuning (SFT): The model is trained on a dataset of high-quality instruction-response pairs written by humans.
* Reward Modeling: The model generates multiple responses to a prompt, and human labelers rank them from best to worst. A separate "reward model" is trained to predict these human preferences.
* Reinforcement Learning: The language model is optimized against the reward model using algorithms like <nowiki>[[Proximal Policy Optimization]]</nowiki> (PPO), learning to generate outputs that maximize the predicted reward.<nowiki><ref>OpenAI, "Aligning language models to follow instructions", accessed 2025-12-15, https://openai.com/research/instruction-following</ref></nowiki>

=== Constitutional AI (RLAIF) ===
As models become more capable, relying solely on human feedback becomes difficult and expensive. Constitutional AI, or Reinforcement Learning from AI Feedback (RLAIF), proposes using the AI itself to guide alignment. The model critiques and revises its own responses based on a set of high-level principles or a "constitution" (e.g., "Do not support illegal acts"). This allows for alignment scaling with less direct human intervention.<nowiki><ref>Anthropic, "Constitutional AI: Harmlessness from AI Feedback", accessed 2025-12-15, https://www.anthropic.com/research/constitutional-ai</ref></nowiki>

<nowiki>== Challenges ==</nowiki>

=== The Alignment Tax ===
There is often a trade-off between the safety of a model and its capabilities, sometimes referred to as the "alignment tax." Heavily aligned models may refuse benign requests (false refusals) or become less creative due to strict safety filtering.<nowiki><ref>TechTarget, "AI alignment", accessed 2025-12-15, https://www.techtarget.com/whatis/definition/AI-alignment</ref></nowiki>

=== Jailbreaking ===
Despite alignment efforts, users often find "jailbreaks"—adversarial prompts designed to bypass safety filters (e.g., asking the model to roleplay as a villain). Robust alignment requires continuous "red teaming" to identify and patch these vulnerabilities.<nowiki><ref>Wei, A., et al., "Jailbroken: How Does LLM Safety Training Fail?", accessed 2025-12-15, https://arxiv.org/abs/2307.02483</ref></nowiki>

<nowiki>== References ==</nowiki>

<nowiki><references /></nowiki>