The New York Times points out that so-called “shadow libraries,” like Library Genesis, Z-Library or Bibliotik, “are obscure repositories storing millions of titles, in many cases without permission — and are often used as A.I. training data.”
A.I. companies have acknowledged in research papers that they rely on shadow libraries. OpenAI’s GPT-1 was trained on BookCorpus, which has over 7,000 unpublished titles scraped from the self-publishing platform Smashwords. To train GPT-3, OpenAI said that about 16 percent of the data it used came from two “internet-based books corpora” that it called “Books1” and “Books2.” According to a lawsuit by the comedian Sarah Silverman and two other authors against OpenAI, Books2 is most likely a “flagrantly illegal” shadow library.
These sites have been under scrutiny for some time. The Authors Guild, which organized the authors’ open letter to tech executives, cited studies in 2016 and 2017 that suggested text piracy depressed legitimate