Cointime

Download App
iOS & Android

THE MODELS ARE YOURS: THE PUBLIC'S LEVERAGE IN AI

Matt Prewitt

January 8, 2024

On December 27, the New York Times filed a lawsuit, claiming that Microsoft and OpenAI infringed the Times’ copyrights by using its writings to train GPT-4 and other AI models. It follows a series of similar lawsuits from the Authors Guild, writers Michael Chabon and Sarah Silverman, and more.

These cases are asking one monumental question: whether training an AI model on copyrighted material violates the copyright. If copyrights are infringed by training models on them, then everyone with a digital footprint has probably already had their rights violated, and will enjoy significant leverage over the future of this transformative technology.

WHAT’S AT STAKE IN THE NEW COPYRIGHT BATTLES

It is a starkly binary legal question: yes or no. Training AI models on copyrighted material either infringes the copyright or doesn’t. And the two possible outcomes point toward very different future worlds.

What would it mean if the AI companies’ lawyers prevail, and courts find that AI models may freely ingest copyright-protected material? Having been trained on authors’ work, AI models will sooner or later be able to do almost exactly what all those authors do, and more, millions of times faster. So AI models’ owners, and not authors, will own most of the fruits of creative labor.

At that point things could get very weird. Disenfranchised authors might start sharing their work only in the shadows by forbidding recordings, banning reviews, and so on. Luddite subcultures might form around efforts to keep creative work off the record and out of the systems.

On the other hand, suppose that courts find that the AI companies have violated authors’ rights. Their potential liability, civil and possibly even criminal, could be as unprecedented as the technology itself. This is because copyright provides for steep statutory damages: a minimum of $750 damages per copyright violation.[1] Given the unimaginable reams of copyrighted works that have presumably already been incorporated into systems like GPT-4 and Claude, you don’t even need to do the math. AI companies, even gigantic ones, could be bankrupted by the damages they owe to—well, all of us.

Of course, destroying the companies is not what most stakeholders want. But the public and the government should appreciate just how much leverage they might have to achieve public-interested outcomes, like a grand settlement resulting in some kind of public governance rights or equity stake.

Leaving aside what companies might already owe, if copyrights are infringed by AI training, the future simply looks different. Content creators, including ordinary people producing copyrightable digital footprints (students, employees, social media users, etc.), could have huge leverage over the future of the technology. If they organize and bargain collectively (instead of getting “picked off” by individual agreements) they will hold the strings to datasets that are necessary ingredients to the world’s most powerful AIs. The public will have a seat at the table.

Europe has already given us one sketchy glimpse of what that might look like. Drafts of the EU’s AI Act, now jeopardized by stalled negotiations, have suggested the bloc may give copyright holders the ability to programmatically “opt-out” of their works’ use in AI training. The artists Holly Herndon and Mat Dryhurst have already set up an organization through which many artists have done just that. It could be a sign of things to come and the EU’s regulations are an important factor in this conversation.

Another possibility must be noted. If it becomes clear that AI cannot be lawfully trained on all publicly available information, it could create an opening for actors beyond the reach of the law. Given the possible military applications of the technology, state actors will not want that to happen. This would nudge the state security apparatus even further into the AI business.

WHY TRAINING ON COPYRIGHTED MATERIAL IS INFRINGEMENT: IT’S LIKE PLAYER PIANOS

With all those considerations lurking, how will courts resolve the key question?[2] Namely: under US law, does training AI on copyrighted materials constitute infringement, or is it fair use?

Courts look at four factors to determine whether a use of copyrighted material is excused as “fair use”. They are:

  1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
  2. the nature of the copyrighted work, that is, whether it is more “expressive” or factual in nature;
  3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
  4. the effect of the use upon the potential market for or value of the copyrighted work.

First, some relevant facts. Large Language Models (the technology which underpin the commercially available generative AI products), can be thought of as a type of data compression (researchers have even experimentally shown than LLMs can simply serve as compression audio files, much like MP3s).[3] When a model trains on a copyrighted text, it stores information about the text in the form of statistical weights relating “tokens” (words, letters, or phrases) to one another. These weights embody information about the statistical relationships between those tokens in the text. This information takes the form of numbers representing the probability that, say, word C will appear after word B if word B is preceded by word A. This data is not stored in silos corresponding to particular texts; instead the model as a whole simply uses the information from each text to modify the information culled from all the other texts it trained on. Humans cannot directly make sense of this statistical data, but they can, with a little effort, use it to reconstruct something very close to the original. The providers of these models install secondary safeguards designed to make this kind of exact reconstruction more difficult; but the fundamental capacity to perform such reconstructions is latent in the technology.

A trained model thus contains information amounting to a “lossy” compression of all of the copyrighted input. This is significant because other forms of “lossy” information compression are clearly copies. For example, an MP3 file compresses the information in a master tape, discarding much of the original recording data. And a human cannot read the binary code of an MP3 file and recognize it as a song. But MP3s are obviously not “fair uses” of recordings. The compressed file can be decompressed, or played back, in a form that sounds similar to the master tapes, a fact sufficient to prove that MP3s are copies.

Returning to the fair use factors now. The third factor, “substantiality”, clearly weighs against the AI companies because whole unaltered texts are fed into the models. The second factor, “expressivity”, does too, since all sorts of works are used in training, including paradigmatically expressive ones.

The fourth factor looks scarcely better for the AI companies. Their models have obvious potential to harm the market for original works. Users can consult models trained on an authors’ work to obtain not only information about those works’ contents, but also a rich experience of their style and character. In many cases, consulting a model might be more efficient and satisfying than consuming the source material myself. This can and does constitute a reason not to buy the book, or subscribe to the magazine, or (soon) watch the movie: core expressive aspects of works can be substantially appreciated through the models alone. Search engines headed off a similar copyright issue when publishing excerpts of news articles, by saying the search engine was driving traffic and revenue to the original authors. But in the case of generative AI, the competitiveness with the original is clearer, and it would be surprising if any court were persuaded that the authors’ works were not being in some important sense superseded.

Now to the first and most important factor. Obviously the uses of models are commercial; but are they transformative in character? This is the argument AI companies will likely end up relying upon.

On the surface, AI models may seem to have transformed expressive training material into bleep-bloop numerical arrays inside an AI model. But these numerical arrays are compressed copies of original works, capable of being transformed right back into works of a near-identical character, just as MP3s are copies of master tapes.

Courts should not be confused by the fact that AI companies package their models with secondary safeguards designed to frustrate exact reconstructions of the source material. The outputs in the chatbox are pastiches, usually “transformative” ones, but those pastiches are not the relevant “copies”. The “copies” are the models themselves: the extraordinarily powerful pastiche-generators capable of rendering outputs that supersede their source material, whose power depends on containing compressed copies of that source material.

New information compression technologies have always affected the nature of whatever they compress. For example, when music recording was invented, music itself changed. In the early 1900s, player piano rolls and phonograph recordings were not legally recognized as unlawful “copies” of protected musical writings. A composer’s work consisted only of musical notation and lyrics; only the sheet music was subject to rights related to reproduction. The Supreme Court affirmed as much in the 1908 case of White-Smith Music Publishing Co. vs. Apollo Co. But even at the time, that didn’t make sense: the case is widely remembered as a judicial misfire. And the copyright law’s period of adjustment to new technology was mercifully brief. Recognizing that piano rolls and audio recording had changed the nature of musical expression, Congress responded, passing the Copyright Act of 1909, which gave musical authors rights in recordings, so-called mechanical rights.

Generative AI is actually very much like player pianos—even down to the eerie, mistaken attribution of disembodied agency. Just as in 1908 recordings were emerging as the predominant artifacts of musical production, generative AI outputs are now emerging as the definitive artifacts of all recorded human expression. This shift will intensify rapidly.

Is our legal and political system still capable of rapidly responding with laws that guarantee authors (even nonprofessional ones) a seat at the table?

A NEW STRATEGY—A NEW COPYRIGHT LAW

As I said, the courts will find either that AI training violates copyrights, or that it doesn’t. Either way, a radically new copyright doctrine will need to be worked out legislatively. But it would be an auspicious start for courts to find against the AI companies now—first, because this is the best interpretation of the current law, and second, because it will rightfully strengthen authors’ bargaining position in any subsequent political settlement.

Across society, we should be organizing to meet the moment and guide our politicians. The tech companies’ lawyers certainly are; the rest of us can’t afford to be years behind them. How should we organize?

First, coalition-building. SAG-AFTRA and the Writers’ Guild have brought this issue to national attention; they should not be fighting alone. Where are the school systems, the universities, the religious organizations, the podcasters, the political movements, and others who have great influence and stake in important and protected data? They should be joining forces, collaborating with Spawning and others. This isn’t a partisan cause, and it isn’t anti-AI; it’s a simple matter of public empowerment.

Second, lawyers, academics, and technologists need to come together to debate and draft the legal resettlement we need. What are the deep principles and common values that we want intellectual property law to protect? Have we, perhaps, been underestimating the diffuse social contributions to “individual” intellectual work for some time; and can we devise a sensible way for the law to now correct this error? Can automatically-created mechanical licenses to musical recordings serve as a template for a new regime that gives everyone a stake in the AI models based upon their work?

The tech lobbyists will surely hand finished text to our representatives. Where is the countervailing proposal?

Special thanks to Lucas Geiger for helpful comments and edits

Notes

  1. Maximum per-violation damages are capped at $30,000, or $150,000 if the infringement was willful. The latter is not out of the question. Willful violations need not be “knowing”, they can be merely “reckless”. And the AI companies have taken many actions indicating that they knew they might be violating copyrights, such as falsely claiming that they did not train on copyrighted material. This is evidence of recklessness. ↩︎
  2. There have been some early setbacks for plaintiffs in these lawsuits, but these are mostly procedural; it is much too early to say that the AI companies will defeat the claims I am focusing on here. ↩︎
  3. Language Modeling Is Compression, https://huggingface.co/papers/2309.10668 ↩︎
Comments

All Comments

Recommended for you

  • $1.2 billion in notional value of BTC options and $930 million in ETH options are set to expire

    Greeks.live data shows that on May 17th, 18,000 BTC options with a put/call ratio of 0.63 and a maximum pain point of $63,000 (nominal value of $1.2 billion) will expire. Additionally, 320,000 ETH options with a put/call ratio of 0.28 and a maximum pain point of $3,000 (nominal value of $930 million) will also expire. Greeks.live states that this week, inspired by the meme stock craze in the US, BTC ETFs have seen significant inflows, causing BTC to surge above $65,000. However, the rest of the crypto market remains weak, with trading volume continuing to decline, and the divergence in the options data of BTC and ETH reflects this. Looking at the structure of bulk trades and market trades, the downward trend in IV for major deadlines has ended and entered a consolidation phase, with limited downside potential at present. BTC longs and shorts are relatively balanced, while the weak ETH price has led to a continuous decline in market confidence, with selling calls becoming the absolute main transaction.

  • Tether CEO: 1 billion USDT will be issued on Tron Network, but it has been authorized but not yet issued

    On May 17th, Tether CEO Paolo Ardoino announced that 1 billion USDT had been issued on the Tron Network early this morning Beijing time, but not yet released. This means that the amount will be used as inventory for the next issuance request and chain exchange.

  • On-chain indexing service Subsquid completes financing of US$17.5 million, with participation from DFG and others

    Subsquid, a chain indexing service, announced the completion of a $6.3 million financing through the CoinList community. As of now, its total financing amount has reached $17.5 million, with participation from DFG, Hypersphere, Zee Prime, Blockchange, and Lattice. It is reported that its native token, SQD, is scheduled to be listed this Friday. The Subsquid SDK has been integrated with Google BigQuery, allowing developers to use Google's technology to analyze blockchain data and reduce the data costs of large-scale deployment in the blockchain and developer communities.

  • Optimism 2024 Q1 Report: The implementation of EIP-4844 reduces L1 submission costs by 99%

    Optimism has released its Q1 2024 report, which shows that the number of daily active addresses has reached 89,000 (a 23% increase compared to the previous period), and the daily transaction volume has increased to 470,000 (a 39% increase compared to the previous period). These indicators are slightly lower than the historical high point in Q3 2023.

  • US Secret Service seizes domain used to run cryptocurrency scam

    On May 17th, the US Secret Service seized a domain used for cryptocurrency trust fraud in a "pig-killing plate" scam. In the "pig-killing plate" scam, scammers contact victims through various means, including dating apps, social media websites, and even random text messages disguised as wrong numbers.

  • Peaq Completes $20 Million Fundraising via CoinList Launch

    Peaq, a Layer1 blockchain applicable for DePIN and machine RWA, announced on X platform that it raised $20 million through its native token Launch, which was launched on CoinList from May 9 to May 16. As of now, over 145,000 community members have completed over-subscriptions of over $36 million. The new funds will be used to accelerate the growth of the peqosystem and further consolidate various ecosystem and community plans.

  • LocalMonero to Shut Down in Six Months Amid Regulatory Pressure and Internal Factors

    LocalMonero, a peer-to-peer exchange for trading privacy coin Monero (XMR), has disabled all trades and will be taken down in six months, according to parent company AgoraDesk. The company cited a combination of internal and external factors for the decision, but did not provide specifics. The move follows a trend of P2P crypto trading platforms shutting down due to regulatory challenges, including LocalBitcoins and Paxful. LocalMonero's closure also comes amid pressure from regulatory authorities on privacy coins, with exchanges including Binance and Coinbase delisting tokens like Monero and Zcash.

  • French securities regulator issues new warning to Bybit

    The French securities regulator has issued a new warning to the cryptocurrency exchange Bybit, urging customers to make arrangements for the possibility that the platform may suddenly stop providing services to French customers. The Financial Markets Authority (AMF) stated in a notice on Thursday that the exchange is not registered as a Digital Asset Service Provider (DASP), and therefore is providing services illegally in France. Bybit has been blacklisted by the AMF since May 20, 2022 for illegal operations.

  • Gaming platform Param Labs completes $7 million financing, led by Animoca Brands

    Gaming platform Param Labs has completed a $7 million financing round, led by Animoca Brands with participation from Delphi Ventures and Cypher Capital. Param Labs aims to establish a gaming ecosystem managed by its native PARAM token, which is set to launch soon. The company's first game, "Kiraverse," is a multiplayer shooting game that allows players to earn money while playing.

  • Blockchain SaaS solution AfriDex completes $5 million Pre-Seed round of financing, led by Endeavor Ventures

    AfriDex, a blockchain software-as-a-service solution based in London, UK, announced the completion of a $5 million Pre-Seed round of financing with Endeavor Ventures leading the investment and African Crops Limited, Oldenburg Vineyards, and Hank Oberoi participating. AfriDex is currently focused on the agricultural market, providing comprehensive on-chain solutions to support and protect supply chain participants, utilizing blockchain technology to achieve traceability, frictionless payments, anti-fraud transactions, verified authentication, simplified tax and subsidy management. (finsmes)