<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.d-ai.co/index.php?action=history&amp;feed=atom&amp;title=Alignment_%28AI%29</id>
	<title>Alignment (AI) - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.d-ai.co/index.php?action=history&amp;feed=atom&amp;title=Alignment_%28AI%29"/>
	<link rel="alternate" type="text/html" href="https://wiki.d-ai.co/index.php?title=Alignment_(AI)&amp;action=history"/>
	<updated>2026-06-16T13:14:40Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.45.1</generator>
	<entry>
		<id>https://wiki.d-ai.co/index.php?title=Alignment_(AI)&amp;diff=15&amp;oldid=prev</id>
		<title>Whale at 07:07, 15 December 2025</title>
		<link rel="alternate" type="text/html" href="https://wiki.d-ai.co/index.php?title=Alignment_(AI)&amp;diff=15&amp;oldid=prev"/>
		<updated>2025-12-15T07:07:45Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 07:07, 15 December 2025&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Line 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;nowiki&amp;gt;&lt;/del&gt;&#039;&#039;&#039;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/nowiki&amp;gt;&lt;/del&gt;AI Alignment&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;nowiki&amp;gt;&lt;/del&gt;&#039;&#039;&#039;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/nowiki&amp;gt; &lt;/del&gt;refers to the process of directing &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;nowiki&amp;gt;&lt;/del&gt;[[Artificial Intelligence]]&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/nowiki&amp;gt; &lt;/del&gt;(AI) systems, particularly &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;nowiki&amp;gt;&lt;/del&gt;[[Large Language Models]]&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/nowiki&amp;gt; &lt;/del&gt;(LLMs), to act in accordance with human intent and ethical values. While the core objective of a pre-trained model is simply to predict the next token in a sequence based on statistical patterns, the goal of alignment is to ensure the resulting behavior is helpful, honest, and harmless.&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;nowiki&amp;gt;&lt;/del&gt;&amp;lt;ref&amp;gt;IBM, &quot;What is AI alignment?&quot;, accessed 2025-12-15, https://www.ibm.com/topics/ai-alignment&amp;lt;/ref&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;gt;&amp;lt;/nowiki&lt;/del&gt;&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&#039;&#039;&#039;AI Alignment&#039;&#039;&#039; refers to the process of directing [[Artificial Intelligence]] (AI) systems, particularly [[Large Language Models]] (LLMs), to act in accordance with human intent and ethical values. While the core objective of a pre-trained model is simply to predict the next token in a sequence based on statistical patterns, the goal of alignment is to ensure the resulting behavior is helpful, honest, and harmless.&amp;lt;ref&amp;gt;IBM, &quot;What is AI alignment?&quot;, accessed 2025-12-15, https://www.ibm.com/topics/ai-alignment&amp;lt;/ref&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Without alignment, LLMs may generate outputs that are factually incorrect (hallucinations), biased, toxic, or dangerous, even if those outputs are statistically probable completions of the input prompt.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Without alignment, LLMs may generate outputs that are factually incorrect (hallucinations), biased, toxic, or dangerous, even if those outputs are statistically probable completions of the input prompt.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;nowiki&amp;gt;&lt;/del&gt;== Core Components ==&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/nowiki&amp;gt; &lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Core Components ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Alignment is generally conceptualized around three main criteria, often referred to as the &quot;HHH&quot; framework:&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;nowiki&amp;gt;&lt;/del&gt;&amp;lt;ref&amp;gt;Askell, A., et al., &quot;A General Language Assistant as a Laboratory for Alignment&quot;, accessed 2025-12-15, https://arxiv.org/abs/2112.00861&amp;lt;/ref&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;gt;&amp;lt;/nowiki&lt;/del&gt;&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Alignment is generally conceptualized around three main criteria, often referred to as the &quot;HHH&quot; framework:&amp;lt;ref&amp;gt;Askell, A., et al., &quot;A General Language Assistant as a Laboratory for Alignment&quot;, accessed 2025-12-15, https://arxiv.org/abs/2112.00861&amp;lt;/ref&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&#039;&#039;&#039;&lt;/del&gt;Helpful:&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&#039;&#039;&#039; &lt;/del&gt;The model should attempt to perform the task specified by the user concisely and efficiently.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* Helpful: The model should attempt to perform the task specified by the user concisely and efficiently.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&#039;&#039;&#039;&lt;/del&gt;Honest:&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&#039;&#039;&#039; &lt;/del&gt;The model should avoid fabricating information or misleading the user.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* Honest: The model should avoid fabricating information or misleading the user.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&#039;&#039;&#039;&lt;/del&gt;Harmless:&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&#039;&#039;&#039; &lt;/del&gt;The model should not generate offensive, discriminatory, or dangerous content, even if explicitly asked to do so.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* Harmless: The model should not generate offensive, discriminatory, or dangerous content, even if explicitly asked to do so.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;nowiki&amp;gt;&lt;/del&gt;== Techniques ==&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/nowiki&amp;gt; &lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Techniques ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Several methodologies have been developed to align models post-pre-training.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Several methodologies have been developed to align models post-pre-training.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l20&quot;&gt;Line 20:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 20:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* Supervised Fine-Tuning (SFT): The model is trained on a dataset of high-quality instruction-response pairs written by humans.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* Supervised Fine-Tuning (SFT): The model is trained on a dataset of high-quality instruction-response pairs written by humans.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* Reward Modeling: The model generates multiple responses to a prompt, and human labelers rank them from best to worst. A separate &amp;quot;reward model&amp;quot; is trained to predict these human preferences.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* Reward Modeling: The model generates multiple responses to a prompt, and human labelers rank them from best to worst. A separate &amp;quot;reward model&amp;quot; is trained to predict these human preferences.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* Reinforcement Learning: The language model is optimized against the reward model using algorithms like &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;nowiki&amp;gt;&lt;/del&gt;[[Proximal Policy Optimization]]&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/nowiki&amp;gt; &lt;/del&gt;(PPO), learning to generate outputs that maximize the predicted reward.&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;nowiki&amp;gt;&lt;/del&gt;&amp;lt;ref&amp;gt;OpenAI, &quot;Aligning language models to follow instructions&quot;, accessed 2025-12-15, https://openai.com/research/instruction-following&amp;lt;/ref&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;gt;&amp;lt;/nowiki&lt;/del&gt;&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* Reinforcement Learning: The language model is optimized against the reward model using algorithms like [[Proximal Policy Optimization]] (PPO), learning to generate outputs that maximize the predicted reward.&amp;lt;ref&amp;gt;OpenAI, &quot;Aligning language models to follow instructions&quot;, accessed 2025-12-15, https://openai.com/research/instruction-following&amp;lt;/ref&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Constitutional AI (RLAIF) ===&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Constitutional AI (RLAIF) ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;As models become more capable, relying solely on human feedback becomes difficult and expensive. Constitutional AI, or Reinforcement Learning from AI Feedback (RLAIF), proposes using the AI itself to guide alignment. The model critiques and revises its own responses based on a set of high-level principles or a &quot;constitution&quot; (e.g., &quot;Do not support illegal acts&quot;). This allows for alignment scaling with less direct human intervention.&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;nowiki&amp;gt;&lt;/del&gt;&amp;lt;ref&amp;gt;Anthropic, &quot;Constitutional AI: Harmlessness from AI Feedback&quot;, accessed 2025-12-15, https://www.anthropic.com/research/constitutional-ai&amp;lt;/ref&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;gt;&amp;lt;/nowiki&lt;/del&gt;&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;As models become more capable, relying solely on human feedback becomes difficult and expensive. Constitutional AI, or Reinforcement Learning from AI Feedback (RLAIF), proposes using the AI itself to guide alignment. The model critiques and revises its own responses based on a set of high-level principles or a &quot;constitution&quot; (e.g., &quot;Do not support illegal acts&quot;). This allows for alignment scaling with less direct human intervention.&amp;lt;ref&amp;gt;Anthropic, &quot;Constitutional AI: Harmlessness from AI Feedback&quot;, accessed 2025-12-15, https://www.anthropic.com/research/constitutional-ai&amp;lt;/ref&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;nowiki&amp;gt;&lt;/del&gt;== Challenges ==&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/nowiki&amp;gt;&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Challenges ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== The Alignment Tax ===&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== The Alignment Tax ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;There is often a trade-off between the safety of a model and its capabilities, sometimes referred to as the &quot;alignment tax.&quot; Heavily aligned models may refuse benign requests (false refusals) or become less creative due to strict safety filtering.&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;nowiki&amp;gt;&lt;/del&gt;&amp;lt;ref&amp;gt;TechTarget, &quot;AI alignment&quot;, accessed 2025-12-15, https://www.techtarget.com/whatis/definition/AI-alignment&amp;lt;/ref&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;gt;&amp;lt;/nowiki&lt;/del&gt;&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;There is often a trade-off between the safety of a model and its capabilities, sometimes referred to as the &quot;alignment tax.&quot; Heavily aligned models may refuse benign requests (false refusals) or become less creative due to strict safety filtering.&amp;lt;ref&amp;gt;TechTarget, &quot;AI alignment&quot;, accessed 2025-12-15, https://www.techtarget.com/whatis/definition/AI-alignment&amp;lt;/ref&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Jailbreaking ===&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Jailbreaking ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Despite alignment efforts, users often find &quot;jailbreaks&quot;—adversarial prompts designed to bypass safety filters (e.g., asking the model to roleplay as a villain). Robust alignment requires continuous &quot;red teaming&quot; to identify and patch these vulnerabilities.&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;nowiki&amp;gt;&lt;/del&gt;&amp;lt;ref&amp;gt;Wei, A., et al., &quot;Jailbroken: How Does LLM Safety Training Fail?&quot;, accessed 2025-12-15, https://arxiv.org/abs/2307.02483&amp;lt;/ref&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;gt;&amp;lt;/nowiki&lt;/del&gt;&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Despite alignment efforts, users often find &quot;jailbreaks&quot;—adversarial prompts designed to bypass safety filters (e.g., asking the model to roleplay as a villain). Robust alignment requires continuous &quot;red teaming&quot; to identify and patch these vulnerabilities.&amp;lt;ref&amp;gt;Wei, A., et al., &quot;Jailbroken: How Does LLM Safety Training Fail?&quot;, accessed 2025-12-15, https://arxiv.org/abs/2307.02483&amp;lt;/ref&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;nowiki&amp;gt;&lt;/del&gt;== References ==&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/nowiki&amp;gt; &lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== References ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;nowiki&amp;gt;&lt;/del&gt;&amp;lt;references /&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;gt;&amp;lt;/nowiki&lt;/del&gt;&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;references /&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Whale</name></author>
	</entry>
	<entry>
		<id>https://wiki.d-ai.co/index.php?title=Alignment_(AI)&amp;diff=14&amp;oldid=prev</id>
		<title>Whale: Created page with &quot;&lt;nowiki&gt;&#039;&#039;&#039;&lt;/nowiki&gt;AI Alignment&lt;nowiki&gt;&#039;&#039;&#039;&lt;/nowiki&gt; refers to the process of directing &lt;nowiki&gt;Artificial Intelligence&lt;/nowiki&gt; (AI) systems, particularly &lt;nowiki&gt;Large Language Models&lt;/nowiki&gt; (LLMs), to act in accordance with human intent and ethical values. While the core objective of a pre-trained model is simply to predict the next token in a sequence based on statistical patterns, the goal of alignment is to ensure the resulting behavior is helpful, honest...&quot;</title>
		<link rel="alternate" type="text/html" href="https://wiki.d-ai.co/index.php?title=Alignment_(AI)&amp;diff=14&amp;oldid=prev"/>
		<updated>2025-12-15T07:01:45Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;&amp;lt;nowiki&amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;lt;/nowiki&amp;gt;AI Alignment&amp;lt;nowiki&amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;lt;/nowiki&amp;gt; refers to the process of directing &amp;lt;nowiki&amp;gt;&lt;a href=&quot;/index.php?title=Artificial_Intelligence&amp;amp;action=edit&amp;amp;redlink=1&quot; class=&quot;new&quot; title=&quot;Artificial Intelligence (page does not exist)&quot;&gt;Artificial Intelligence&lt;/a&gt;&amp;lt;/nowiki&amp;gt; (AI) systems, particularly &amp;lt;nowiki&amp;gt;&lt;a href=&quot;/wiki/Large_Language_Models&quot; title=&quot;Large Language Models&quot;&gt;Large Language Models&lt;/a&gt;&amp;lt;/nowiki&amp;gt; (LLMs), to act in accordance with human intent and ethical values. While the core objective of a pre-trained model is simply to predict the next token in a sequence based on statistical patterns, the goal of alignment is to ensure the resulting behavior is helpful, honest...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&amp;lt;nowiki&amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;lt;/nowiki&amp;gt;AI Alignment&amp;lt;nowiki&amp;gt;&amp;#039;&amp;#039;&amp;#039;&amp;lt;/nowiki&amp;gt; refers to the process of directing &amp;lt;nowiki&amp;gt;[[Artificial Intelligence]]&amp;lt;/nowiki&amp;gt; (AI) systems, particularly &amp;lt;nowiki&amp;gt;[[Large Language Models]]&amp;lt;/nowiki&amp;gt; (LLMs), to act in accordance with human intent and ethical values. While the core objective of a pre-trained model is simply to predict the next token in a sequence based on statistical patterns, the goal of alignment is to ensure the resulting behavior is helpful, honest, and harmless.&amp;lt;nowiki&amp;gt;&amp;lt;ref&amp;gt;IBM, &amp;quot;What is AI alignment?&amp;quot;, accessed 2025-12-15, https://www.ibm.com/topics/ai-alignment&amp;lt;/ref&amp;gt;&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Without alignment, LLMs may generate outputs that are factually incorrect (hallucinations), biased, toxic, or dangerous, even if those outputs are statistically probable completions of the input prompt.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;== Core Components ==&amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
Alignment is generally conceptualized around three main criteria, often referred to as the &amp;quot;HHH&amp;quot; framework:&amp;lt;nowiki&amp;gt;&amp;lt;ref&amp;gt;Askell, A., et al., &amp;quot;A General Language Assistant as a Laboratory for Alignment&amp;quot;, accessed 2025-12-15, https://arxiv.org/abs/2112.00861&amp;lt;/ref&amp;gt;&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Helpful:&amp;#039;&amp;#039;&amp;#039; The model should attempt to perform the task specified by the user concisely and efficiently.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Honest:&amp;#039;&amp;#039;&amp;#039; The model should avoid fabricating information or misleading the user.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Harmless:&amp;#039;&amp;#039;&amp;#039; The model should not generate offensive, discriminatory, or dangerous content, even if explicitly asked to do so.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;== Techniques ==&amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
Several methodologies have been developed to align models post-pre-training.&lt;br /&gt;
&lt;br /&gt;
=== Reinforcement Learning from Human Feedback (RLHF) ===&lt;br /&gt;
RLHF is currently the dominant technique for aligning state-of-the-art models. It involves a multi-step process:&lt;br /&gt;
&lt;br /&gt;
* Supervised Fine-Tuning (SFT): The model is trained on a dataset of high-quality instruction-response pairs written by humans.&lt;br /&gt;
* Reward Modeling: The model generates multiple responses to a prompt, and human labelers rank them from best to worst. A separate &amp;quot;reward model&amp;quot; is trained to predict these human preferences.&lt;br /&gt;
* Reinforcement Learning: The language model is optimized against the reward model using algorithms like &amp;lt;nowiki&amp;gt;[[Proximal Policy Optimization]]&amp;lt;/nowiki&amp;gt; (PPO), learning to generate outputs that maximize the predicted reward.&amp;lt;nowiki&amp;gt;&amp;lt;ref&amp;gt;OpenAI, &amp;quot;Aligning language models to follow instructions&amp;quot;, accessed 2025-12-15, https://openai.com/research/instruction-following&amp;lt;/ref&amp;gt;&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Constitutional AI (RLAIF) ===&lt;br /&gt;
As models become more capable, relying solely on human feedback becomes difficult and expensive. Constitutional AI, or Reinforcement Learning from AI Feedback (RLAIF), proposes using the AI itself to guide alignment. The model critiques and revises its own responses based on a set of high-level principles or a &amp;quot;constitution&amp;quot; (e.g., &amp;quot;Do not support illegal acts&amp;quot;). This allows for alignment scaling with less direct human intervention.&amp;lt;nowiki&amp;gt;&amp;lt;ref&amp;gt;Anthropic, &amp;quot;Constitutional AI: Harmlessness from AI Feedback&amp;quot;, accessed 2025-12-15, https://www.anthropic.com/research/constitutional-ai&amp;lt;/ref&amp;gt;&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;== Challenges ==&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== The Alignment Tax ===&lt;br /&gt;
There is often a trade-off between the safety of a model and its capabilities, sometimes referred to as the &amp;quot;alignment tax.&amp;quot; Heavily aligned models may refuse benign requests (false refusals) or become less creative due to strict safety filtering.&amp;lt;nowiki&amp;gt;&amp;lt;ref&amp;gt;TechTarget, &amp;quot;AI alignment&amp;quot;, accessed 2025-12-15, https://www.techtarget.com/whatis/definition/AI-alignment&amp;lt;/ref&amp;gt;&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Jailbreaking ===&lt;br /&gt;
Despite alignment efforts, users often find &amp;quot;jailbreaks&amp;quot;—adversarial prompts designed to bypass safety filters (e.g., asking the model to roleplay as a villain). Robust alignment requires continuous &amp;quot;red teaming&amp;quot; to identify and patch these vulnerabilities.&amp;lt;nowiki&amp;gt;&amp;lt;ref&amp;gt;Wei, A., et al., &amp;quot;Jailbroken: How Does LLM Safety Training Fail?&amp;quot;, accessed 2025-12-15, https://arxiv.org/abs/2307.02483&amp;lt;/ref&amp;gt;&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;== References ==&amp;lt;/nowiki&amp;gt; &lt;br /&gt;
&lt;br /&gt;
&amp;lt;nowiki&amp;gt;&amp;lt;references /&amp;gt;&amp;lt;/nowiki&amp;gt;&lt;/div&gt;</summary>
		<author><name>Whale</name></author>
	</entry>
</feed>