<https://www.washingtonpost.com/technology/2023/09/08/gmail-instagram-facebook-trains-ai/>

It’s your Gmail. It’s also Google’s artificial intelligence factory.

Unless you turn it off, Google uses your Gmail to train an AI to finish other people’s sentences. It does that by analyzing how you respond to its suggestions. And when you opt in to using a new Gmail function called Help Me Write, Google uses what you type into it to improve its AI writing, too. You can’t say no.

Your email is just the start. Meta, the owner of Facebook, took a billion Instagram posts from public accounts to train an AI, and didn’t ask permission. Microsoft uses your chats with Bing to coach the AI bot to better answer questions, and you can’t stop it.

Increasingly, tech companies are taking your conversations, photos and documents to teach their AI how to write, paint and pretend to be human. You might be accustomed to them selling your data or using it to target you with ads. But now they’re using it to create lucrative new technologies that could upend the economy — and make Big Tech even bigger.

We don’t yet understand the risk that this behavior poses to your privacy, reputation or work. But there’s not much you can do about it.

Sometimes the companies handle your data with care. Other times, their behavior is out of sync with common expectations for what happens with your information, including stuff you thought was supposed to be private.
Skip to end of carousel
How your data trains Big Tech’s AI
Meta says it can use the contents of photos and videos shared to “public” on its social networks to train its AI products. You can make your Instagram account private or change the audience for your Facebook posts.
Gmail, by default in the U.S., uses how you respond to its Smart Compose suggestions to train the AI to better finish people’s sentences. You can opt out.
Microsoft uses your conversations with its Bing chatbot to “fine-tune” the AI and share with its partner OpenAI. There is no way to opt-out as a consumer.
Google learns from your conversations with its Bard chatbot, including having some reviewed by humans. You can ask Google to delete your chat history, but it will still hold on to chats for up to 72 hours.
Google uses what you type and other “interactions” with its Workspace Labs AI in Gmail, Docs, Slides and Sheets to help its AI become a better creative coach. You cannot opt out if you want to use these functions.
Google uses your private text or voice conversations with its Assistant to “fine-tune” the responses of Assistant or Bard. You can opt out by adjusting your Google privacy settings to not save your activity.
Google says it can use “publicly available information” to train its AI, including the contents of YouTube videos and Google Docs that have been published to the Web.
End of carousel

Zoom set off alarms last month by claiming it could use the private contents of video chats to improve its AI products, before reversing course. Earlier this summer, Google updated its privacy policy to say it can use any “publicly available information” to train AI. (Google didn’t say why it thinks it has that right. But it says that’s not a new policy and it just wanted to be clear it applies to its Bard chatbot.)

If you’re using pretty much any of Big Tech’s buzzy new generative AI products, you’ve likely been compelled to agree to help make their AI smarter, sometimes including having humans review what you do with them.

Lost in the data grab: Most people have no way to make truly informed decisions about how their data is being used to train AI. That can feel like a privacy violation — or just like theft.

“AI represents a once-in-a-generation leap forward,” says Nicholas Piachaud, a director at the open source nonprofit Mozilla Foundation. “This is an appropriate moment to step back and think: What’s at stake here? Are we willing just to give away our right to privacy, our personal data to these big companies? Or should privacy be the default?”
New privacy risks

It isn’t new for tech companies to use your data to train AI products. Netflix uses what you watch and rate to generate recommendations. Meta uses what you like, comment on and even spend time looking to train its AI how to order your news feed and show you ads.

Yet generative AI is different. Today’s AI arms race needs lots and lots of data. Elon Musk, chief executive of Tesla, recently bragged to his biographer that he had access to 160 billion video frames per day shot from the cameras built into people’s cars to fuel his AI ambitions.

“Everybody is sort of acting as if there is this manifest destiny of technological tools built with people’s data,” says Ben Winters, a senior counsel at the Electronic Privacy Information Center (EPIC), who has been studying the harms of generative AI. “With the increasing use of AI tools comes this skewed incentive to collect as much data as you can upfront.”

All of this brings some unique privacy risks. Training an AI to learn everything about the world means it also ends up learning intimate things about individuals, from financial and medical details to people’s photos and writing.

Some tech companies even acknowledge that in their fine print. When you sign up to use Google’s new Workspace Labs AI writing and image-generation helpers for Gmail, Docs, Sheets and Slides, the company warns: “don’t include personal, confidential, or sensitive information.”

The actual process of training AI can be a bit creepy. Companies employ humans to review some of how we use products such as Google’s new AI-fueled search called SGE. In its fine print for Workspace Labs, Google warns it may hold your data seen by human reviewers for up to four years in a manner not directly associated with your account.

Even worse for your privacy, AI sometimes leaks data back out. Generative AI that’s notoriously hard to control can regurgitate personal info in response to a new, sometimes unforeseen prompt.

It even happened to a tech company. Samsung employees were reportedly using ChatGPT and discovered on three different occasions that the chatbot spit back out company secrets. The company then banned the use of AI chatbots at work. Apple, Spotify, Verizon and many banks have done the same.

The Big Tech companies told me they take pains to prevent leaks. Microsoft says it de-identifies user data entered in Bing chat. Google says it automatically removes personally identifiable information from training data. Meta says it will train generative AI not to reveal private information — so it might share the birthday of a celebrity, but not regular people.

Okay, but how effective are these measures? That’s among the questions the companies don’t give straight answers to. “While our filters are at the cutting edge in the industry, we’re continuing to improve them,” says Google. And how often do they leak? “We believe it’s very limited,” it says.

It’s great to know Google’s AI only sometimes leaks our information. “It’s really difficult for them to say, with a straight face, ‘we don’t have any sensitive data,’” says Winters of EPIC.

Perhaps privacy isn’t even the right word for this mess. It’s also about control. Who’d ever have imagined a vacation photo they posted in 2009 would be used by a megacorporation in 2023 to teach an AI to make art, put a photographer out of a job, or identify someone’s face to police? When they take your information to train AI, companies can ignore your original intent in creating or sharing it in the first place.

There’s a thin line between “making products better” and theft, and tech companies think they get to draw it.
Your data, their rules

Which data of ours is and isn’t off limits? Much of the answer is wrapped up in lawsuits, investigations and hopefully some new laws. But meanwhile, Big Tech is making up its own rules.

I asked Google, Meta and Microsoft to tell me exactly when they take user data from products that are core to modern life to make their new generative AI products smarter. Getting straight answers was like chasing a squirrel through a funhouse.

They told me they hadn’t used nonpublic user information in their largest AI models without permission. But those very carefully chosen words leave a lot of occasions when they are, in fact, building lucrative AI businesses with our digital lives.

Not all AI uses for data are the same, or even problematic. But as users, we practically need a degree in computer science to understand what’s going on.

Google is a great example. It tells me its “foundational” AI models — the software behind things like Bard, its answer-anything chatbot — come primarily from “publicly available data from the internet.” Our private Gmail didn’t contribute to that, the company says.

However, Google does still use Gmail to train other AI products, like Smart Compose (which finishes sentences for you) and the new creative coach Help Me Write that’s part of its Workspace Labs. Those uses are fundamentally different from “foundational” AI, Google says, because it’s using data from a product to improve that product. The Smart Compose AI, it says, anonymizes and aggregates our information and improves the AI “without exposing the actual content in question.” It says the Help Me Write AI learns from your “interactions, user-initiated feedback, and usage metrics.” How are you supposed to know what’s actually going on?

Perhaps there’s no way to create something like Smart Compose without data about how you use your email. But that doesn’t mean Google should just switch it on by default. In Europe, where there are stricter data laws, Smart Compose is off by default. Nor should access to your data be a requirement to use its latest and greatest products, even if Google calls them “experiments.”

Meta told me it didn’t train its biggest generative AI model, called Llama 2, on user data — public or private. However, it has trained other AI, like an image-identification system called SEER, on people’s public Instagram accounts. To avoid that, you’d have to have set your account to private, or quit Instagram.

And Meta wouldn’t answer my questions about how it’s using our personal data to train generative AI products it is expected to unveil soon. After I pushed back, the company said it would “not train our generative AI models on people’s messages with their friends and families.” At least it agreed to draw some kind of red line.

Microsoft updated its service agreement this summer with broad language about user data, and it didn’t make any assurances to me about limiting the use of our data to train its AI products. Microsoft tells me it does not use our data from Word or other Microsoft 365 programs to “train underlying foundational models,” but that’s not the question I was asking.

The consumer advocates at Mozilla also launched a campaign calling on Microsoft to come clean. “If nine experts in privacy can’t understand what Microsoft does with your data, what chance does the average person have?” Mozilla says.

It doesn’t have to be this way. Microsoft has lots of assurances for lucrative corporate customers, including those chatting with the enterprise version of Bing, about keeping their data private. “Data always remains within the customer’s tenant and is never used for other purposes,” says a spokesman.

Why do companies have more of a right to privacy than all of us?