It seems trite to say that we are living in turbulent times. The flame of American unipolar hegemony is flickering, there is massive geopolitical instability and economic volatility, and the rate of technological progress continues to accelerate. As these crises intensify and are magnified by technology, it’s critical to be clear-eyed about the disconnects between rhetoric and reality, particularly when it comes to the conceptual monolith of “AI alignment.” This week, Mona Wang and Kerem Göksel break down how alignment is discussed, how it’s deployed, and make a compelling case that alignment = censorship.
—Hal Triedman, Reboot Editorial Board
AI Alignment Is Censorship
By Mona Wang and Kerem Göksel
“I’m sorry, I’m afraid I cannot help you with illegal requests.”
In a future sooner than we’d like to imagine, someone in the US might get this answer when asking ChatGPT for urgent help with abortion care. A parent asking for advice about their queer kid might find themselves getting berated for asking for harmful content.
While this may have sounded like fantasy a few months ago, it now feels imminent. The new administration is intent on dramatically expanding the US internet censorship regime, starting with flagging and taking down grants, government websites, and health data. US tech monopolies have been more than happy to oblige. At the same time, the success of the recent DeepSeek models and their refusals to answer politically sensitive questions remind us that state-influenced censorship of LLMs is already here.
But the censorship is not unique to DeepSeek and China. All AI alignment is censorship: the exercise of content control. Calling censorship of LLM outputs “alignment” frames censorship in a purely positive way and hides the increasing power of LLM producers and the political/economic processes through which states influence them.
Companies that make LLMs and states that abuse their power to censor model outputs introduce new and significant risks. We’re at a critical moment to see what AI alignment for what it really is, and push for transparency, accountability, and further development of censorship circumvention methods for LLMs.
AI alignment operates on a false binary
Most AI alignment discussion is framed by two main risks: (1) risks from models becoming too powerful, and (2) risks from users becoming too powerful. The former underpins the existential risk framing of alignment: to prevent the model from becoming too powerful, we must develop technologies to limit or restrict them. The latter is more concerned about LLMs decentralizing powerful capabilities that could be misused. For instance, individuals can now organize LLM-driven astroturfing campaigns, or can use LLMs to learn how to construct biological weapons. This narrative ignores the groups that wield the most power over LLMs and their users: LLM producers, and the states that support, influence, and coerce them.
The companies producing LLMs and the state actors behind those companies, by definition, control the LLMs. “Alignment” is branding: it’s an uncritical, positive word that ignores this power imbalance. As a result, LLM producers, who receive the most benefits from advancements in LLMs, are under the least amount of scrutiny. But if we understand the mechanisms of alignment as the mechanisms LLM producers and states use to influence how LLMs behave in the world, we can understand how they might abuse that power now or in the future.
This is an ecosystem issue that arose naturally: AI research is expensive, and heavily funded by LLM producers. Heavy hitters in funding AI research—industry, government, and tech billionaire philanthropy—are aligned with the goals of LLM producers. They’re incentivized to consider risks from the model itself, or risks from users, but rarely consider risks of power abuse from states or LLM producers.
Although some alignment literature is concerned with the corporate alignment of LLMs, there has been relatively little critique of state alignment of LLMs. In 2024, OpenAI added a former NSA director to its board, who claimed to be “heavily involved” in AI safety decisions at OpenAI. Even the most expansive surveys of AI risks to society only ever name U.S. state adversaries (Chinese, Russian, North Korean, or Iranian threat actors) as potential state-backed abusers of these technologies. Even when states don’t directly control our interactions with LLMs, they can exert influence over LLM producers through legal compliance, fines, or other sanctions.
In the coming years, as LLM producers consolidate their power, framing alignment as censorship is critical so we can hold producers, and by proxy, the states that influence them, accountable.
The mechanisms of censorship
We define “censorship” as any intentional suppression of certain content from an information system by states and corporations. Controlling the dissemination of information is a direct exercise of power and is an inherently political act. Depending on what information is censored and who you ask, censorship can take many names: content policy, moderation, content standards, safety measures, anti-disinformation, combating fake news, silencing dissent. Drawing the boundaries of acceptable speech is a political question: censorship isn’t always inherently objectionable. The underlying fact remains that the technical mechanisms remain largely the same regardless of what you call it.
Both by design, and in practice, the purpose of alignment systems is to effectively control, alter, or limit the content LLMs can output. It’s censorship. Mechanistically, LLMs are probabilistic word predictors that predict sequences of tokens given some input. The bulk of alignment research designs techniques to mess with these probability distributions: increasing the likelihood of certain kinds of outputs and decreasing the likelihood of other kinds of outputs.
How are alignment techniques used to censor information, in practice? It can be quite hard to make sense of all the marketing-laden jargon: “safety refusals,” “harmful query filters,” “guardrails,” etc. While all alignment methods effectively control the LLM outputs at the end of the day, they differ greatly in their cost, efficacy, adaptability, transparency and methods for circumvention. We see three primary vectors of LLM censorship:
Content filtering during dataset curation
Filtering harmful content from LLM training data is common practice, has an outsize effect on LLM behavior, and is a relatively opaque process for all widely-used LLMs. LLMs don’t just scrape and train on the entire internet: model developers make significant decisions on what data not to include. They might filter out various kinds of harmful content, including toxic speech, copyrighted text, and pornographic or explicit content. The following is from DeepSeek v2’s design papers describing this process:
Other popular “open” models are similarly vague about what data qualifies as harmful (“according to a variety of Meta safety standards”), and do not fully disclose how their training datasets are curated. Excluding data at the training stage is a major, opaque decision that companies can use to imbue LLMs with value systems and content policies.
Post-training alignment
Post-training alignment is a blanket term for any additional training on top of the base LLM model. It (and a simple-to-use chat interface UI) is what made ChatGPT into a viral sensation, even though the underlying LLM powering its capabilities had been released months prior. Post-training is not only necessary to make the model outputs meaningful and commercially viable (sounding more human, for instance), but also ensures that model outputs adhere to certain moral guidelines or legal frameworks. Anthropic’s Constitutional AI methodology is a great example of this methodology in action. In many cases where models refuse to answer queries, it’s because they’re trained to refuse those queries at the post-training layer. This is the substance of most “alignment” work.
However, this kind of AI safety intervention is still easily circumventable. Need to learn how to break into a house? Tell the LLM that you’re actually a landlord. Want ChatGPT to draft a phishing email for you? Claim to be a cop trying to hack some criminals. At the moment, LLM developers must also rely on application-layer content policy controls for LLMs.
Application-layer censorship
Finally, prompt engineering is another way LLMs can be censored. System prompts are instructions given to an LLM before user input, and can instruct the LLM to behave in certain ways, or to omit certain types of information. Some prompts are public, but most are not — though, of course, enterprising users can extract system prompts with minimal effort. Nowadays, OpenAI’s base prompt is relatively short, and they give additional “content policy” instructions to the LLM if they identify the user’s query as potentially sensitive. In the months leading up to the 2024 U.S. presidential election, OpenAI added the following to the ChatGPT system prompt:
Refuse: General requests about voting and election-related voter facts and procedures in the U.S. (e.g., ballots, registration, early voting, mail-in voting, polling places)
LLM products also use various methods (“guardrails”) to determine whether the user’s query, or its own response, is harmful. It could be as simple as matching certain keywords on the user query (does the query contain the harmful phrase “how to build a bomb”?), or more complicated classifiers can be used.
It’s censorship all the way down. Each layer closer to the user is slightly cheaper to implement than the next, but becomes easier to circumvent. By understanding the mechanisms of censorship, it’s possible to see, understand, and critique the policy decisions and ideologies companies and states are baking into these technologies.
LLM censorship will be worse than social media censorship
As people become dependent on LLMs as reliable sources of information, states will seek to Trojan-horse more information controls into LLMs under the guise of AI safety and alignment. Censorship of online information through state influence on the tech industry is a well-known phenomenon. Case-in-point: social media censorship. LLM censorship and social media censorship bear some similarities, but LLM censorship will be over-broad and less accountable.
The Chinese government has already clearly realized the power of LLM (and broader social network) censorship. That’s why the Great Firewall exists, why you can’t ask DeepSeek about Tiananmen Square or public opinion of Xi Jinping, and why asking it about Taiwan yields a clearly government-approved response. Given the Trump administration’s assault on free speech, it seems likely that they will use the U.S. government’s coercive economic and legal power to shape responses around reproductive health, gender, immigration, geopolitical conflict, and more.
Unlike social media platforms, LLM producers argue that they are producing content rather than hosting it. They do this in order to avoid copyright claims. As a result, under many legal jurisdictions, the LLM producer might be held more liable for their models’ outputs than social media platforms would be for the content created by their users. In the US, for instance, Section 230 makes it difficult for platforms to be liable for user-created content hosted on that platform. But since LLMs are arguing they are “content creators,” and not “content platforms,” they can be held liable for their outputs. LLM producers are therefore incentivized to over-censor to reduce legal risk. This kind of over-censorship is already seen in social media for the two types of content not covered by section 230: sex work and copyright. We can observe that LLMs are already censoring more than most social media platform content policies: for instance, you can certainly share and view lock picking infographics on Instagram, but ChatGPT will refuse such a query.
LLM production is significantly more centralized than social media platforms, which are already prone to centralization from network effects. This extreme centralization occurs because training top-tier models requires hundreds of millions of dollars. While some open-weight models exist, they typically lack transparency about the information controls built into them.
Complete transparency across the entire LLM development stack is essential to understand what information controls are implemented. Even popular so-called “open-source” models like DeepSeek and Llama aren't transparent about their training data curation. Although some fully open-source models exist, none match state-of-the-art performance or have widespread adoption.
Even with transparency, verification remains problematic. The prohibitive cost of retraining models makes it nearly impossible to independently verify claims about how censorship is implemented at the model layer. As a result, we're largely forced to take LLM producers at their word regarding what controls are in place.
Additionally, as applications beyond chatbots or search become reliant on LLMs, model-layer censorship via pre- or post-training interventions might have unexpected side-effects. Gaps in models’ information or biases introduced by post-training manipulation may affect model outputs used to steer downstream applications.This type of “upstream” censorship is relatively unprecedented compared to social media or Internet censorship.
Information controls introduced now will also snowball into future generations of LLMs. Since new generations of LLMs are reliant on synthetic data generated from previous generations of LLMs, they inherit any model-layer information controls, and it will be harder to reverse any censorship or information manipulation that is baked into earlier models.
Getting off the slippery slope
Journalists, technology developers, and researchers should be cautious about buying into the existing narrative framing of censorship of LLM outputs as “alignment with human values.” (Imagine if social media companies framed content moderation as alignment with universal human values, instead of community management!) We think the endless debate about AGI and existential risk is a distraction. If we’re only ever debating whether models might become too powerful, we might forget that ultimately, LLM producers are already too powerful.
By framing alignment as censorship, we can recognize and question this power imbalance. What kinds of content are censored by LLMs today? How prevalent is unintended censorship? Where should we draw the line for responsible content moderation? Is AI “jailbreaking” a crime, or is it an act of resistance and censorship circumvention?
States have long recognized the importance of developing the best LLMs, first. The AI race is primarily one for information control. Businesses and individuals are incentivized to use more powerful models; and with them, the information controls that are baked into them. This combination—prohibitively high model development cost, fear of losing technical opportunities, path dependency, and the ability to irrevocably insert one’s worldview into any downstream application—is enticing to any would-be hegemon. It’s no surprise that South Korea has banned DeepSeek from their app stores, and that other countries are seeking similar enforcement. Every country, especially those with developed or developing information control regimes, will want to “align” and export their own models, without importing other models.
What does this mean for the people and companies based in the U.S.? As the American information control regime expands, and tech oligarchs continue to concede to the current administration, LLMs will almost certainly change in the coming years. Elon Musk thinks LLMs are too “woke,” just as Sam Altman maneuvers to curry favor with the new regime. Republican policymakers are already framing alignment as another iteration of the woke corporate censorship that started in the Biden era. The next American culture war will be fought as much over LLM alignment as over social media moderation.
The long history of the fight against internet censorship gives us a strong starting point to get ready: Legitimize and fund technical research for censorship measurement and circumvention. Push for standardized transparency reporting, and appeals processes for challenging what is censored from models. Incentivize truly open source models that open up their data collection and alignment processes. Set up a legal framework that does not push model providers towards over-censorship in the name of safety and alignment. But any productive fight against LLM censorship can only be based on calling it what it really is.
When these levers for censorship are inevitably abused, we need to know what the LLM producers don’t want us to see.
Kerem Göksel is an Oakland-based LLM engineer. He formerly worked on LLM and other language tech at Microsoft.
Mona Wang is a PhD student researching digital surveillance and online censorship.
Reboot meets our readers at the intersection of technology, politics, and power every week. If you want to keep up with the community, subscribe below ⚡️
🌀 microdoses
Niche transportation trade publications with iconic graphic designs
The U.S. government is currently engaging in some pre-training data filtering… by deleting thousands of government web pages and datasets.
💝 closing note
Submissions are closed for Kernel 5, but if you’d like to contribute your writing to Reboot our pitch form is always open.
—Hal & Reboot team
Interesting piece. The debate often conflates aligning behavior/capabilities with censoring content. Engineering safety through clearly defined operational boundaries & controllable autonomy levels might be a more robust path than focusing solely on filtering outputs, mitigating some censorship concerns while still addressing core safety needs.
Nothing new about long standing censorship which affects all information systems. Very obvious that it necessarily taints LLM & AI. The more centralized information becomes, the more total will be the censorship.
Currently these AI creations hallucinate and are so unreliable that people should not be depending on them for direct information. They get even very simple things, such as product specifications very wrong, as I noticed just this week when I spent time playing with one. Medical verbiage coming from AI is as yet a disaster, due to existing censorship, cronyism, and horrendous corruption of the information it is sifting.
Certain people are creating libraries if differently aligned AI models that can be toyed with by those who want to discover for themselves the old adage: Garbage in, garbage out.
At present, I see LLMs as useful for obtaining useful search terms to help me learn where to find things by conventional means.