The perils of using someone else’s data

Another day. Another lawsuit.

So the RIAA has just sued two music generation companies – Suno and Udio. And this is hardly the first of such lawsuits. In the recent past, we’ve seen lawsuits against OpenAI, against Github, and others, and it may well be that the lawsuit avalanche is just getting started.

Whatever else Generative AI may be, it is a magnet for lawsuits.

Data and the resulting AI model

All these lawsuits have one thing in common. The essence of the allegation is that these companies used someone else’s data to train their AI models and are making money off these models, which they therefore don’t deserve. We’ll leave it to lawyers, judges, and lawmakers to figure out the boundaries between a subtle copy that deserves penalties, a synthesis resulting something original and useful, and something not AI generated at all. We expect it to be difficult, as illustrated by the case of Miles Astray who submitted a real photo (“flamingon”) to the 1839 Awards contest under the AI generated images category and won!

Bias

There is another serious issue with training on someone else’s data. That is bias. And zeal in trying to correct for that bias can produce a different kind of bias, too! Remember Google’s AI’s embarrassing depiction of the first set of US statesmen? But bias may not just be a gaffe. It may also result in misreading a situation and possible accidents.

And even if the model works, it is no guarantee that it will work a year from now, because the character of the evaluated data itself may change, and the model trained on the earlier data may need to be retrained. This problem is what is known as model drift. An easy way to understand it is to look at text from the year 1924 and today. What might sound polite to us today might have been fighting words then.

Generative and Discriminative AI

Generative AI gets a free pass from most of us, most of the time. That is because we don’t expect Gen-AI to produce “falsifiable propositions,” to borrow Karl Popper’s term. When we ask Gen-AI to write a story or produce an image, we usually can’t – and don’t care to – look too closely at what it produces. That’s because we have no hard references to compare it against and make pronouncements.

But discriminative AI is different. For example, discerning whether a customer’s email indicates an irate customer or a happy one is usually not a matter of opinion. Or whether inspected blueberries are A or B or C grade by the receiving organization’s standards. Or the extent to which a part is worn out for the usage for which it is being evaluated.

In such cases, using models trained on someone else’s data may be like using someone else’s prescription eyewear.

To be fair, this also applies to Gen-AI, but the effects are less likely to cause grief, don’t jump out and bite us right away, and at least for text generation are mitigated by retrieval augmented generation that is often part of the Gen-AI solution.

The takeaway?

Use your own data wherever, whenever, and as much as possible.

  • It may be the only way to get an AI model that works for you, and specifically for you.
  • It is your data, so you don’t owe anyone anything.
  • If it is operations, i.e., repeated activity, you literally get a river of data, so you won’t be starved of data. Naturally, for AI on repeated activity, model drift becomes a non-issue.
  • And in those rare cases where your AI model is indeed truly general, you can rent it out, which means your cost center may become a profit center.

Got questions or
want to know more?

Contact us, and we will reply to you shortly.