New York Times vs Microsoft/OpenAI — Quick Digest
I have read the lawsuit document so you don’t have to. Here are the things you should know:
The New York Times sued OpenAI and Microsoft in an interesting lawsuit filed yesterday.
They have also demanded Jury Trial in this case.
1. NYT has been around for more than 170 years supplying trustworthy information.
2. Alleges that the LLMs from OAI were built by copying and using millions of NYT articles. They were used in MS Bing Chat and OAI’s ChatGPT.
3. Copyright law protects the NYT’s original journalism.
4. These LLMs use verbatim articles from NYT, summaries of articles or even attribute false information to NYT.
5. OAI increased its valuation to $90B, and MS went beyond a trillion dollars using the valuable IP of NYT without any subscription, licensing, advertising or affiliate revenue.
6. NYT has negotiated agreements with Meta, Apple and Google for their respective news products but they were not able to reach a similar agreement with OAI/MS for their LLMs.
7. MS/OAI believe that their use of NYT articles is protected under “fair use” as it is for a new “transformative” purpose. However, NYT contests this as it believes that the outputs of these models closely mimic the NYT article inputs and thus they’re not “transformative”.
8. NYT seeks to hold MS/OAI responsible for billions of dollars in statutory & actual damages.
9. NYT employs 5800 full time equivalent employees who contribute to the quality journalism that NYT offers. This includes pieces like investigative reporting, breaking news, reviews, opinions etc.
10. These GenAI products threaten High-Quality Journalism. While NYT gives permission for search engines for surfacing in traditional search results, it has never given permission for use of its contents in GenAI purposes.
11. OAI raised money from wealthy individuals promising altruism but has since moved to being a for-profit organisation.
12. While OAI open-sources the design and secrets of GPT-1 & 2, they never open-sourced the more powerful versions 3.5 & 4. They justified it on commercial and competitive grounds.
13. MS helped OAI do mass copyright infringement by being the sole compute provider. They operated a system with 285K CPU cores, 10K GPUs & 400 Gigabits/sec network connectivity between the GPUs.
14. MS combined Open AI with Bing to create Bing Chat which mimics NYTs responses, and thus users no longer need to visit NYT website.
15. Approx 1.76 trillion parameters in GPT-4.
16. LLMs exhibit “memorisation”. Given the right prompt, they’ll repeat large portions of the materials they were trained on.
17. In GPT-2, OAI used a dataset called WebText containing text contents of 45 million links. Out of these, NYT is one of the top 15 by volume.
18. NYT content — a total of 209K unique URLs accounts for 1.23% of all sources in the WebText2 dataset for GPT-3. WT-2 has a weight of 22% in the training mix for GPT-3.
19. The highly weight data set called Common Crawl with 60% weightage in the training mix for GPT-3 provides 100 million tokens and represents third-highest just behind US Patents and Wikipedia.
20. OAI admits that “higher quality datasets” are sampled more frequently, resulting in NYT articles being sampled more frequently than other sources.
21. NYT highlights many examples with screenshots where ChatGPT responded with a significant portion of data from the relevant NYT article essentially allowing users to bypass its paywall. This is true for Bing Chat as well.
22. NYT’s Wirecutter recommendations are also distorted by ChatGPT in its responses thus attributing misinformation to NYT causing brand damage. This is caused by “hallucinations”. NYT highlight multiple such prompts with screenshots of responses from ChatGPT/Browse with Bing/Bing Chat.
23. Use of NYT content without permission helped ChatGPT garner attention from users and increase its revenue significantly. Similarly, Bing crossed 100 million DAU for the first time in its history. MS is integrating OAI in its products and charging subscription fee for it. OAI is essentially distributing NYT’s paid content for free.
FINAL ASK: PAY FOR DAMAGES, OR DESTRUCTION OF GPT MODELS THAT USE NYT CONTENTS.