Fine-Tuning on Your Own Data: Why 'We Host It Ourselves' Solves the Wrong Half of the GDPR Problem
Self-hosting can reduce the third-country transfer exposure but leaves the harder problem untouched: once personal data has influenced the weights, deletion on request is currently difficult to verify and may demand retraining, unlearning, or output controls of uncertain reliability. Legitimate interest is not a default — it is a three-step test the EDPB sets a high bar for. Why the lawful basis must be settled before training, not after.
Published on June 9, 2026
Fine-Tuning on Your Own Data: Why ‘We Host It Ourselves’ Solves the Wrong Half of the GDPR Problem
CORE THESIS Self-hosting an AI model can remove much of the third-country transfer exposure — depending on where the infrastructure, support access, telemetry, and subprocessors actually sit — but it leaves the harder problem in place: training is not a transient use of personal data — it is an incorporation of it into the model’s weights that is currently difficult, costly, and often unreliable to reverse, with deletion on request ranging from disproportionately expensive to, in practice, not yet dependably achievable. Legitimate interest can carry such processing, but only as the conclusion of a three-step test the EDPB sets a deliberately high bar for — never as a default. The lawful basis, the erasure pathway, and the impact assessment all belong before the first training run, not after.
There is a question organisations rarely ask before training an AI model on their own data, and it is a different question from the one they ask about prompts. Typing personal data into a chatbot is a transient interaction the provider may or may not retain — and in an important sense it is recoverable: a retained prompt can be deleted, a contract can be renegotiated, a tool can be switched off. Training a model on personal data is a problem with no undo button.
An organisation I advise had reached the stage many reach in 2026. The free-tier exposure had been closed, an enterprise contract was in place, and the team had grown ambitious. The proposal on the table was to fine-tune an open-weights model — hosted entirely on the organisation’s own infrastructure, inside the EU, never touching a US provider — on several years of internal records: support tickets, HR case notes, performance reviews, customer correspondence. The argument was elegant and, on its face, hard to fault. “Everything stays in-house. No data leaves our servers. No American company can touch it. This is the privacy-preserving option.” The slide deck called it “GDPR by architecture.”
It was half right, and the half it got wrong is the half that matters more. Self-hosting does address a genuine and serious question — the jurisdictional one, the CLOUD Act and FISA 702 transfer exposure that comes with any US-controlled provider — though even that holds only to the extent that the infrastructure, support access, telemetry and any subprocessors are genuinely outside US reach. But it addresses that question and then quietly assumes it has answered all the others. It has not touched the legal-basis question — on what lawful ground are years of employee and customer personal data being processed for an entirely new purpose? And it has walked directly into a question that self-hosting actually makes worse, not better: when one of those employees or customers later exercises their right to erasure, can the organisation comply — or has it baked their personal data into a model’s weights, where it now lives in a form that is difficult, costly, and often unreliable to remove?
If that sounds abstract, consider what happened to the largest deployer of a generative model in the world when the question of deletion stopped being theoretical. In the copyright litigation brought by The New York Times, a federal magistrate judge ordered OpenAI on 13 May 2025 to preserve and segregate all ChatGPT output logs that would otherwise have been deleted — including the chats users had themselves deleted and content from “temporary” sessions that OpenAI’s own policy would normally purge within thirty days.1 OpenAI challenged the order, its leadership calling it a direct conflict with the company’s privacy commitments and with GDPR; the challenge was declined and the preservation obligation remained in force until late September 2025, after which roughly 20 million de-identified logs were ordered produced to the plaintiffs.2 The episode is a near-perfect inversion of the erasure problem this article is about — a court forcing data to be kept rather than a data subject asking for it to be removed — but it makes the same underlying point with unusual clarity: the question of what an AI system can and cannot delete, and who actually controls that, is not a footnote. To be precise about what it does and does not show: it is not evidence that the model’s weights contain those logs: it concerns retained interaction logs, not the trained parameters. What it does demonstrate is that AI data lifecycles are not fully governed by the user-facing “delete” button — and if a court can override deletion of logs, the harder question of deletion from the weights is not one an organisation should assume it controls either. It is now litigated at the highest levels, and the answers are frequently uncomfortable for the organisation that thought it was in control of its own data.
One detail of OpenAI’s response matters for the rest of this article. OpenAI was careful to state that ChatGPT Enterprise and its Zero Data Retention API were excluded from the preservation order, and that it does not train on business data by default.3 That is the enterprise/consumer distinction doing real work in a real courtroom: the contractual surface a controller chooses determines, concretely, whether its data is swept into a discovery order. But notice what even that protection does not touch — the model that has already been trained. A preservation order is about logs; erasure from the weights is a different and harder problem, and it is the one a self-hosting organisation takes onto its own books.
The question this article answers is the one the elegant slide deck never reached: not “where does the model run?” but “can you delete a person from it?” Everything that follows is an attempt to take that question seriously, because the GDPR does.
Training is not a use — it is an incorporation
The conceptual error underneath “GDPR by architecture” is treating training as if it were processing of the same kind as querying. It is not. When you send a prompt to a model, the personal data in the prompt is processed transiently for the duration of the request; on a properly contracted enterprise tier, it is then governed by retention rules and can, in principle, be deleted. When you train or fine-tune a model on personal data, something categorically different happens: the patterns in that data are encoded into the model’s parameters — billions of numerical weights — and the data ceases to exist as discrete, addressable records.4
This is not a metaphor. The European Data Protection Board, in Opinion 28/2024, declined to treat trained models as automatically anonymous precisely because of this: an AI model trained on personal data cannot in all cases be considered anonymous, and it qualifies as anonymous only where the controller can demonstrate, with evidence, that the likelihood of extracting personal data from the model — directly or probabilistically — and the likelihood of obtaining it through queries are both insignificant.5 A fine-tuned model that has memorised the specifics of named support tickets and named performance reviews does not clear that bar. It is, in the EDPB’s framing, a model that still contains personal data — just in a diffuse, distributed, hard-to-reach form.
The research literature is unambiguous about what that means for deletion. Large language models demonstrably memorise and can regurgitate personal data from their training corpus — names, addresses, contact details, clinical notes — verbatim or approximately.6 And removing a specific individual’s data once it is in the weights is, in the current state of the art, an unsolved problem dressed up in optimistic vocabulary. The available routes are: full retraining from scratch with the individual’s data excluded (definitive but expensive, and impractical at any cadence); machine unlearning techniques (active research, not yet reliable, and difficult to verify); or output filtering that suppresses generation of the erased person’s data without actually removing it from the model.7 Commentators have put the consequence bluntly: deleting personal data from the training set has no effect on a model that has already been trained, because there is, as yet, no dependable way to make a trained model forget.8
This is the asymmetry that “GDPR by architecture” misses entirely. Keeping the data in-house improves your control over where it is. It does nothing to improve your ability to get a specific person out of it once training has occurred. A self-hosted model is not a privacy-preserving model by virtue of being self-hosted; it is a model whose erasure problem now sits on your own balance sheet rather than a vendor’s.
Legitimate interest is a test, not a checkbox
Set the erasure problem aside for a moment and ask the prior question: on what lawful basis is the organisation processing years of personal data for the new purpose of training a model? Under Article 6 GDPR, every processing operation needs a basis, and processing data collected for one purpose (running a support desk, administering employment) for a materially different purpose (training a model) is itself a processing operation requiring its own justification.
For most internal training scenarios, consent is the wrong tool. Employee consent is presumptively unfree because of the power imbalance in the employment relationship; retrospective consent from thousands of past customers is impractical to obtain and, where obtained under pressure, fragile. That pushes controllers toward Article 6(1)(f) — legitimate interest. The EDPB confirmed in Opinion 28/2024 that legitimate interest can be a valid basis for processing in both the development and deployment of AI models.9 But it confirmed this in the same breath as making clear that legitimate interest is explicitly not a default basis, and is acceptable only where the controller can pass the structured three-step test the EDPB built on its Guidelines 1/2024.10
The three steps are worth stating precisely, because in practice organisations perform the first and skip the other two.
Step one — the interest must be legitimate. It must be lawful, clearly and precisely articulated, and real and present rather than speculative.11 “We might find some efficiency gains” is speculative; “we will reduce average ticket-resolution time by routing on historical patterns” is articulable. Vagueness fails at step one.
Step two — necessity. The processing must be strictly necessary, meaning no less intrusive means would achieve the same interest. The EDPB sets a deliberately high bar here in relation to the volume of personal data involved.12 If the training objective can be met with a smaller, anonymised, or synthetic dataset, then processing the full corpus of named records is not necessary, and the basis fails — regardless of how legitimate the underlying interest is. This is where most internal training proposals quietly collapse: “train on everything we have” is rarely the least intrusive means.
Step three — balancing. The controller’s interest must not be overridden by the rights, freedoms, and reasonable expectations of the data subjects.13 An employee who submitted a grievance, or a customer who emailed a complaint, did not reasonably expect that text to become training material for a model years later. The further the new purpose sits from the original expectation, the heavier the balancing tilts against the controller — and where legitimate interest is relied upon, the data subject’s right to object under Article 21 must be available throughout.14
The EDPB’s own examples of interests that can survive this test are instructive in their narrowness: a conversational agent to assist users, an AI system to detect fraud, improving threat detection in an information system.15 These share a feature — a tight, demonstrable necessity link between the data and the purpose. “Fine-tune a general model on all our historical HR notes” has no such tight link, and a legitimate-interest assessment that pretends otherwise is an assessment written to reach a predetermined answer.
What self-hosting genuinely fixes — and what it does not
None of this is an argument against self-hosting. Self-hosting is, for the right use case, the strongest available answer to the third-country transfer problem. A model running on infrastructure the organisation controls, inside the EU, on hardware no US-incorporated provider can be compelled to reach, removes the CLOUD Act and FISA 702 exposure at the root. That is a real and valuable property, and for sensitive processing it may be decisive.
The error is only in the substitution — in treating the jurisdictional fix as if it discharged the other obligations. Laid out plainly, the two axes are independent:
| Question | What self-hosting changes | What remains your problem |
|---|---|---|
| Can a US authority compel disclosure of the data? | Removes the exposure — provided there is genuinely no US-controlled entity or infrastructure anywhere in the chain | Nothing, if the chain is truly free of US-reachable parties |
| Is there a lawful basis for training on this data? | Nothing — the Article 6 / legitimate-interest test is unaffected by where the model runs | The full three-step assessment, exactly as before |
| Can we honour an erasure request after training? | Nothing — and the burden arguably grows, because the obligation now sits with you, not a vendor | A verifiable erasure pathway you must build yourself |
| Is a DPIA required? | Nothing — high-risk processing triggers Article 35 regardless of hosting | The DPIA, in full |
The right to erasure under Article 17 deserves a specific note here, because the self-hosting team had assumed it was their strength. The reasoning was: “we hold the data, so we can delete it.” That is true of the source records and false of the model. Deleting the original support ticket does not remove what the model learned from it. EU guidance does recognise a technical-impossibility and disproportionate-effort dimension under Article 17 — but it does not hand controllers a blanket excuse. The expectation is that an organisation must be able to demonstrate it explored reasonable technical alternatives (unlearning, retraining, robust output filtering with verification) before claiming that erasure from the model is infeasible.16 “We can’t, because that’s how neural networks work” is not, on its own, a compliant answer; it is the beginning of an assessment the controller is expected to have done in advance.
A stress test: the erasure request that arrives after training
Put the abstraction under load. Eighteen months after the fine-tune ships, a former employee — the one who filed a grievance that sat in the HR case notes — sends a written erasure request under Article 17. They want every trace of their personal data removed. The organisation deletes the source ticket from its database in minutes; that part is genuinely easy. Then the data protection officer asks the harder question: is this person’s data still in the model?
Here the comfortable answers run out one by one. The team cannot simply assert the model is anonymous — under Opinion 28/2024 that requires demonstrating, with evidence, that extraction and query-leakage likelihoods are both insignificant, and a fine-tune that memorised named HR notes will struggle to show it. They cannot quietly rely on an output filter that suppresses the name, because suppression is not removal and a regulator examining the matter will say so. They cannot credibly invoke Article 17(3) “technical impossibility” unless they can show they explored unlearning, retraining, and verification before this request arrived — and they did not, because the slide deck stopped at “GDPR by architecture.” The only fully defensible remedy left is to retrain the model from scratch without that individual’s data, at the cost and downtime that implies, every time such a request arrives.
This is the moment the difficulty stops being a theoretical property and becomes an operational and financial one. Notice, too, that self-hosting has made this worse, not better: there is no vendor to share the burden, no processor whose contract absorbs part of the obligation. The organisation that said “everything stays in-house” now owns the erasure problem in-house as well. The stress test is the proof of the thesis — the time to answer “can we delete a person from this model?” was before the data ever entered the weights, because afterward the remaining answers tend to be some combination of expensive, hard to verify, or legally exposed.
The order of operations is the whole discipline
Everything above reduces to a sequence, and the sequence is the point. The failure mode is never that an organisation reaches the wrong conclusion at the end — it is that it trains first and reasons afterward, at which point the difficulty of reversing what training has done has already foreclosed most of the options.
The correct order, before a single training run:
First, decide the purpose precisely. Not “improve operations” but the specific, articulable interest that will have to survive step one of the legitimate-interest test.
Second, minimise to that purpose. Establish the smallest, least-identifying dataset that meets the objective — anonymised where possible, synthetic where viable, pseudonymised at minimum, scoped to necessity. This is step two of the test done as engineering, not paperwork. It is also the single most effective intervention, because data that never enters the weights creates no erasure problem later.
Third, run the DPIA. Deploying a model trained on personal data at scale almost certainly meets the Article 35 high-risk threshold, and the DSK is explicit that a controller who is not also the system’s provider still owes its own risk assessment.17 The DPIA is where the balancing test, the erasure pathway, and the residual risks are written down — before, not after.
Fourth, design the erasure pathway in advance. Decide now, on paper, how a future Article 17 request will be met: which combination of source-record deletion, retraining cadence, unlearning, and output filtering applies, and how compliance will be verified. A model trained without an erasure plan is a model that will, sooner or later, receive a request it cannot honour.
Fifth — and only then — train. With the purpose fixed, the data minimised, the risk assessed, and the exit designed.
Self-hosting belongs to this discipline as one strong control among several. It is not a substitute for the discipline.
The takeaway
SUMMARY FOR DECISION-MAKERS Self-hosting can reduce or remove the transfer exposure — depending on where infrastructure, support access, telemetry and subprocessors sit — but does little else: the legal-basis question and the erasure question are untouched, and the erasure question is arguably made worse because the obligation is now yours. Training incorporates personal data into the weights in a way that is currently difficult, costly and often unreliable to reverse — once it is there, deletion on request cannot be reliably verified short of costly retraining or unlearning of uncertain reliability — so the lawful basis (a three-step legitimate-interest test the EDPB sets a high bar for), the data minimisation, the DPIA, and the erasure pathway all belong before the first training run. The right question is not “where does the model run?” but “can we delete a person from it?” — and if the answer is no, the model should not have been trained on them in the first place.
Training is not a use of personal data — it is an incorporation of it. A prompt can be deleted; a weight cannot be un-learned with anything like the same confidence. That single asymmetry is why processing data through a model and training a model on data are different problems, not two versions of one.
Self-hosting is a real answer to the transfer question and a non-answer to the lawful-basis and erasure questions. “Everything stays in-house” describes where the data is; it says nothing about whether you were permitted to train on it, or whether you can comply when someone asks to be removed.
The discipline is the order of operations: purpose, minimisation, DPIA, erasure plan — then training. Reverse that order and the difficulty of undoing what training has done has already made the decision for you.
Glossary of abbreviations
| Term | Definition |
|---|---|
| Fine-tuning | Further training of a pre-trained model on a narrower dataset to specialise it |
| Weights / parameters | The billions of numerical values into which a model encodes patterns from training data |
| Legitimate interest | Lawful basis under Art. 6(1)(f) GDPR, requiring a three-step assessment |
| LIA | Legitimate Interest Assessment — the documented three-step test |
| Machine unlearning | Research techniques aiming to remove specific data from a trained model without full retraining |
| Right to erasure | The data subject’s right under Art. 17 GDPR to have personal data deleted |
| DPIA | Data Protection Impact Assessment (Art. 35 GDPR), required for high-risk processing |
| Self-hosted | A model run on infrastructure the organisation controls, rather than a vendor’s |
| EDPB Opinion 28/2024 | The EDPB’s December 2024 opinion on personal data in AI models |
| DSK | Datenschutzkonferenz — Germany’s assembly of data protection authorities |
Legal notice: This article serves general information purposes and does not constitute legal advice. For a legally sound assessment in a specific case, consultation with a specialised data protection lawyer is recommended. As of: June 2026.
-
In the consolidated copyright litigation led by The New York Times (S.D.N.Y.), Magistrate Judge Ona T. Wang ordered OpenAI on 13 May 2025 to preserve and segregate all ChatGPT output log data that would otherwise be deleted — including user-deleted and “temporary” chats normally purged within ~30 days. See OpenAI, “How we’re responding to The New York Times’ data demands” https://openai.com/index/response-to-nyt-data-demands/ and coverage at https://www.thurrott.com/a-i/openai-a-i/330404/openai-must-turn-over-chatgpt-logs-in-new-york-times-case (accessed June 2026). ↩
-
OpenAI opposed the order and lost; the indefinite-retention obligation ran until it was lifted effective 26 September 2025, and in November 2025 the court ordered production of ~20 million de-identified ChatGPT logs (Dec 2022–Nov 2024) to the plaintiffs. See Bloomberg Law, “OpenAI Must Turn Over 20 Million ChatGPT Logs, Judge Affirms,” 12 Nov 2025 https://news.bloomberglaw.com/ip-law/openai-must-turn-over-20-million-chatgpt-logs-judge-affirms (accessed June 2026). ↩
-
OpenAI, “How we’re responding to The New York Times’ data demands” — ChatGPT Enterprise was clarified as excluded from the preservation order; Zero Data Retention API content is not retained and is unaffected; and “we don’t train our models on business data by default.” Available at: https://openai.com/index/response-to-nyt-data-demands/ (accessed June 2026). ↩
-
Duality Technologies, “LLMs and Data Privacy: How to Protect Sensitive Information”, 2026 — during training an LLM encodes patterns from its corpus into billions of parameters; for a model whose weights encoded an individual’s data, Article 17 deletion is technically challenging. Available at: https://dualitytech.com/blog/llm-data-privacy/ (accessed June 2026). ↩
-
European Data Protection Board, “Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models”, adopted 17 December 2024 — a model is anonymous only where the likelihood of extracting personal data (directly or probabilistically) and of obtaining it via queries are both insignificant, assessed case by case. Available at: https://www.edpb.europa.eu/system/files/2024-12/edpb_opinion_202428_ai-models_en.pdf (accessed June 2026). ↩
-
“What Should LLMs Forget? Quantifying Personal Data in LLMs for Right-to-Be-Forgotten Requests”, arXiv 2507.11128 (XKDD 2025 / ECML PKDD 2025) — LLMs memorise and can leak PII verbatim or approximately; post-hoc filters and RAG do not remove memorised content from the model itself. Available at: https://arxiv.org/pdf/2507.11128 (accessed June 2026). ↩
-
GDPR Local, “Large Language Models (LLM) GDPR Compliance”, 9 December 2025 — erasure options: machine unlearning, full retraining (may degrade accuracy), or output filtering; Art. 17(3) technical-impossibility exceptions require demonstrating reasonable alternatives were explored. Available at: https://gdprlocal.com/large-language-models-llm-gdpr/ (accessed June 2026). ↩
-
IAPP, “Perspective: Why data subjects’ rights to LLM training data are not relevant”, 17 February 2026 — deleting personal data from training data has no impact on an already-trained LLM; “machine unlearning” of an already-trained model is, in the current state of the art, not yet possible. Available at: https://iapp.org/news/a/perspective-why-data-subjects-rights-to-llm-training-data-are-not-relevant (accessed June 2026). See also TechPolicy.Press, “The Right to Be Forgotten Is Dead”, 20 May 2025. ↩
-
European Data Protection Board, news release “EDPB opinion on AI models: GDPR principles support responsible AI”, 18 December 2024 — legitimate interest can be a legal basis for development and deployment, but only where processing is shown to be strictly necessary and the balancing of rights is respected. Available at: https://www.edpb.europa.eu/news/news/2024/edpb-opinion-ai-models-gdpr-principles-support-responsible-ai_en (accessed June 2026). ↩
-
CMS, “EDPB Opinion 28/2024: key takeaways”, 20 March 2026 — the EDPB asserts that Art. 6(1)(f) cannot be the “by default” legal basis for training and use of AI models; it is acceptable only on a demonstrated three-step legitimate-interest assessment built on Guidelines 1/2024. Available at: https://cms.law/en/deu/legal-updates/edpb-opinion-28-2024-key-takeaways-on-processing-personal-data-in-the-context-of-ai-models (accessed June 2026). ↩
-
EDPB Opinion 28/2024, on the first step — an interest is legitimate where it is (1) lawful, (2) clearly and precisely articulated, and (3) real and present (not speculative). See also Debevoise Data Blog, 14 April 2025. Available at: https://www.debevoisedatablog.com/2025/04/14/gdpr-considerations-when-developing-and-deploying-ai-models-the-edpbs-opinion-on-compliance/ (accessed June 2026). ↩
-
IAPP, “EDPB weighs in on key questions on personal data in AI models”, 17 February 2026 — the EDPB sets a high bar for necessity in relation to the volume of personal data involved in the model. Available at: https://iapp.org/news/a/edpb-weighs-in-on-key-questions-on-personal-data-in-ai-models (accessed June 2026). ↩
-
European Papers, “Processing Personal Data in the Context of AI Models: EDPB’s Opinion 28/2024”, 27 February 2025 — the third step balances the legitimate interest against data subjects’ fundamental rights, the impact of the processing, and their reasonable expectations. Available at: https://www.europeanpapers.eu/europeanforum/protecting-personal-data-in-context-of-ai-models (accessed June 2026). ↩
-
EDPB Opinion 28/2024 — wherever legitimate interest is relied upon, the Article 21 right to object applies and must be ensured. ↩
-
EDPB news release, 18 December 2024 (op. cit.) — examples of interests capable of relying on legitimate interest: a conversational agent to assist users; an AI system to detect fraudulent content or behaviour; improving threat detection in an information system. ↩
-
GDPR Local (op. cit.) — EU guidance recognises technical-impossibility exceptions under Article 17(3), but organisations must demonstrate they explored reasonable technical alternatives before claiming such exceptions for AI systems. ↩
-
Datenschutzkonferenz (DSK), “Orientierungshilfe Künstliche Intelligenz und Datenschutz” — a DPIA is frequently required for AI processing, and where the controller is not also the provider of the AI system, the controller remains obliged to carry out its own risk assessment. Available at: https://www.datenschutzkonferenz-online.de/orientierungshilfen.html (accessed June 2026). ↩