Blog

GitHub Just Made Your Developers' Code Training Data. Did Anyone Tell Your Audit Committee?

Subscribe now to join the Risk Register community:

On March 25, 2026, GitHub announced what we'll generously call a "policy update." Starting April 24, interaction data from Copilot Free, Pro, and Pro+ users will be used to train GitHub's AI models (unless users actively opt out). 

And by "interaction data," we mean pretty much everything: the code you write, the prompts you send, the file structures you navigate, even the cursor context as you hunt through your repository at 11 PM trying to figure out why production is down.

The default setting? Opt-in. Do nothing, and your code becomes training material. It's the software equivalent of an organ donor card you never remember signing.

Now, before we spiral into panic mode, let's be clear: not everyone is affected.

If your entire organization runs on Copilot Business or Enterprise, you're covered by contractual data protection agreements that explicitly prohibit this kind of data use. Students and teachers with free academic access are also excluded. So if you've got your licensing house in order and everyone's on an enterprise tier, congratulations. You can skip to the end and enjoy your well-governed afternoon.

But here's where things get interesting. 

The affected tiers (Copilot Free, Pro, and Pro+) are exactly the ones individual developers sign up for on their own. You know, the ones your procurement team has no visibility into. The ones that don't show up in your SaaS inventory. The ones purchased with personal credit cards because the enterprise provisioning process takes six weeks and involves three approvals.

So the real question isn't "does our organization use Copilot?"

It's "do we actually know which Copilot tier every developer touching our code is using?" And if you just felt a small knot form in your stomach, welcome to the party. You're in good company.

Because here's the thing: this policy change isn't creating a new risk. It's just turning on the floodlights and showing us a governance gap that's been sitting there all along, quietly growing while we were busy auditing last year's controls with last decade's frameworks.

What's Actually Changing (And Why It Matters)

Let's break down exactly what GitHub is collecting.

Starting April 24, if you're on Copilot Free, Pro, or Pro+, the following goes into the training pipeline:

  • Code outputs a developer accepts or modifies
  • Inputs sent to Copilot, including code snippets visible to the model
  • Code context surrounding the cursor position
  • Comments and documentation written during use
  • File names, repository structure, and navigation patterns
  • Interactions with Copilot features (chat, inline suggestions)
  • Feedback on suggestions (thumbs up/down ratings)

All of this gets shared with Microsoft and other GitHub affiliates.

The good news? It won't be shared with third-party AI model providers. 

The bad news? If you think Microsoft and its corporate family tree is a narrow distribution list, we should talk.

Here's the technical detail that matters: GitHub says it doesn't use private repository content "at rest" for training. That sounds reassuring until you realize that Copilot processes code from private repositories during active use, and that interaction data is absolutely fair game for training unless you opt out.

In other words, the protection isn't about whether your repo is public or private. It's about whether your code is sitting on a server somewhere or actively flowing through an AI model while you're working. That's a meaningful distinction, and it's one most organizations haven't thought through yet.

The Shadow IT Problem Nobody Wants to Talk About

Let's talk about the elephant in the data center. According to EY's 2026 Technology Pulse Poll, 52% of department-level AI initiatives are running without formal approval, and nearly 47% of generative AI users are accessing tools through personal accounts (not enterprise ones).

Translation: there's a coin-flip chance that half your development team is using AI coding assistants on personal accounts right now. You don't know which tier they're on. You don't know if they've opted out. You probably don't even know they're using it, because the provisioning request is still sitting in someone's Jira backlog marked "medium priority."

And GitHub just flipped the default switch to "yes, please train on all of this."

Most organizations have spent years (not to mention budgets) building controls around data loss prevention. You've got DLP policies, source code management protocols, intellectual property protection frameworks. All of those controls assume data leaves your organization through identifiable channels: email attachments, cloud storage uploads, USB drives tucked into laptop bags, unauthorized SaaS applications that someone in marketing installed.

AI coding assistants don't fit that model. They operate in real time, ingesting code context and business logic as developers work, generating suggestions that get accepted or rejected, learning from every interaction. It's data exfiltration, except it doesn't look like exfiltration. It looks like productivity.

And under GitHub's new policy, all of that interaction data becomes training material by default. Unless someone opts out. Which, statistically speaking, almost nobody will, because most developers have no idea this policy change is even happening.

How Protection Became a Subscription Feature

Here's the governance puzzle that should be keeping someone awake at night: GitHub's data protection doesn't follow the code. It follows the license.

Picture this: two developers working in the same private repository. One's on a Business license. Their interaction data is contractually protected. The other's on a personal Pro account they signed up for themselves. Their interaction data is fair game for training. Same repository. Same code. Different privacy outcomes based entirely on who's paying the subscription fee.

It's like having a conversation where half the participants are covered by attorney-client privilege and half aren't, except nobody's quite sure who's in which group and the stakes involve your organization's proprietary code.

Most organizations have zero visibility into this.

The governance gaps are predictable and probably familiar:

  • Developers use personal GitHub accounts alongside enterprise accounts
  • Shadow IT adoption of AI tools has outpaced policy
  • Copilot licenses are procured at the team or department level without centralized oversight
  • Contractors or freelancers with personal accounts contribute to company repositories

One developer on a personal account. One private repository. One policy change they don't know about. That's all it takes to create a data leak that doesn't look like a data leak just someone doing their job with a productivity tool.

This is not a hypothetical risk. According to research from InfoQ, one developer noted that individual users within an organization typically do not have the authority to license their employer's source code to third parties. A single team member who does not opt out could potentially expose proprietary code through their Copilot interactions.

So Where Does This Leave Internal Audit?

Let's ask the question that's probably been floating around since paragraph three: if this risk was sitting there all along, why are we only talking about it now?

Did your internal audit function have a process in place to monitor AI tool vendor policy changes?


Do you have a way to track which developers are using personal accounts versus enterprise accounts?


Can you flag when a major vendor announces a shift in data usage terms?

If the answer is "not really" or "we're working on it," you're in good company. 

Most organizations are in the same boat, rowing in circles, trying to figure out how to govern tools that didn't exist when they wrote the governance framework.

Here's the uncomfortable truth: GitHub isn't unique. Every major AI tool vendor is moving toward this model using interaction data for training, with varying opt-out mechanisms, varying transparency, and varying contractual protections for enterprise customers. This isn't a one-time event. It's the new operating environment.

If AI infrastructure demand can cause non-linear disruptions in memory pricing (hello, RAM shortage), it can absolutely create sudden policy shifts in data governance. Most audit frameworks weren't built to detect that kind of fast-moving risk, and this GitHub policy change is a perfect test case to find out where the gaps are.

What Good Governance Actually Looks Like Here

If you're assessing your organization's exposure, here are the five areas that matter:

AI Tool Inventory. You need to know what's actually in use not just what's approved. Catalog every AI coding assistant, code completion tool, and AI-powered IDE extension touching your code. Include sanctioned and shadow IT. Map each tool's data handling terms by license tier. If you don't have this inventory, you can't scope the risk. Full stop.

Account Governance. Determine whether developers are using personal accounts, enterprise accounts, or (most likely) both. Identify where personal-tier accounts interact with company code. Then require enterprise-tier licensing for any AI tool touching proprietary repositories. This needs to be a policy control, not a suggestion buried in an email nobody reads.

Configuration Management. For tools with opt-out settings, verify those settings are enabled and enforced at the organizational level. Don't rely on individual developers to configure privacy settings correctly. Treat AI tool configuration as a technical control, not a user preference.

Acceptable Use Policies. Update your data handling policies to explicitly address AI tool interaction data. Define what qualifies as "company data" in the context of AI-assisted development: prompts, suggestions, file context, repository metadata, all of it. Make it explicit that interaction data is company data, subject to the same protections.

Third-Party Risk Assessment. Evaluate AI tool vendors through your existing TPRM framework, with AI-specific additions. Assess data sharing terms, including affiliate sharing (like GitHub's arrangement with Microsoft). Add AI-specific data flow questions to vendor assessments. If your TPRM process doesn't cover AI vendors yet, it's time to update it.

Why This Matters Beyond GitHub

Let's zoom out for a second, because this isn't really about Copilot.

GitHub's policy change is one data point in a larger pattern. AI governance is moving from a back-office compliance task to a board-level strategic concern. The rise of AI agents and autonomous systems is forcing organizations to rethink how they manage risk, accountability, and trust at scale. 

And internal audit is uniquely positioned to assess whether governance frameworks are keeping pace.

The question isn't whether your organization uses AI tools, it obviously does. The question is whether anyone has actually mapped the data flows, evaluated the contractual terms, and verified that controls match the risk profile.

If your governance framework was built before AI coding assistants existed (and most were), it almost certainly has gaps. This policy change is a forcing function to find them which is useful, even if the timing is inconvenient.

But here's what we should really be asking: why did it take a vendor announcement to surface this risk? A proactive audit function should have been tracking AI tool adoption, monitoring vendor policy changes, and assessing the gap between enterprise and personal-tier protections months ago. If that didn't happen, it's worth understanding why not to assign blame, but to build the capability going forward.

Three Things to Do This Week

1. Scope the exposure. Figure out which Copilot licensing tiers are actually in use across your organization. If everyone's on Business or Enterprise, document that and confirm the data protection terms apply. If anyone's using Free, Pro, or Pro+ accounts with company code including contractors, freelancers, or that one developer who got tired of waiting for procurement that's your exposure. You need to know the scope before you can manage the risk.

2. Check the opt-out setting. For any account on an affected tier touching proprietary code, make sure the "Allow GitHub to use my data for AI model training" setting is disabled. Don't wait until April 24. Don't assume developers will handle it themselves. Make it a configuration control and verify it's set correctly.

3. Add AI tools to your audit scope. If AI coding tools aren't in your audit universe yet, add them. If they are, verify that your risk assessment reflects current vendor policies, not what was true two years ago when someone copied the template. Map the data flows. Find the gaps. Report them. This is exactly the kind of emerging risk internal audit should be surfacing.

The Bigger Question for Audit Functions

Here's what this really comes down to: the value of internal audit isn't just catching what went wrong after the fact. It's providing foresight. It's helping the organization connect macro-level signals to operational exposure early enough that leadership can actually do something about it.

Six months ago, a proactive conversation with procurement, IT, and finance about AI tool governance could have meant locking in enterprise-tier licenses before this policy change hit, accelerating governance policy development, or at minimum building realistic risk assumptions into vendor assessments. If that conversation didn't happen, it's worth asking why not to point fingers, but to genuinely assess whether your risk monitoring processes are set up to catch these signals.

Are your processes picking up external signals like vendor policy changes?

Are you connected to the right data sources to spot them?

 

Are your findings reaching decision-makers with enough lead time to matter?

If the honest answer is "not consistently," you're not alone. Most GRC frameworks simply haven't caught up to the pace of AI-driven change yet. 

But that gap between early signal and operational impact is shrinking fast, and closing it through better risk monitoring, tighter feedback loops between audit and IT, and a culture of early risk communication is exactly how audit functions evolve from compliance checkboxes into strategic advisors.

The next disruption is already on its way. The question is whether your function will see it in time to do something about it.

---

For questions about AI governance, IT audit strategy, or internal audit program design, contact Cherry Hill Advisory at cherryhilladvisory.com.

Subscribe now to join the Risk Register community: