According to filings in Kadrey v. Meta company employees referred to LibGen as a “data set we know to be pirated," and flagged that its use “may undermine [Meta’s] negotiating position with regulators.”
Some decision-makers within Meta apparently believed that failing to use Libgen for model training could seriously hurt Meta’s competitiveness in the AI race, calling Libgen “essential to meet SOTA numbers across all categories,”
referring to topping the best, state-of-the-art (SOTA) AI models and benchmark categories.
The filing also cites a memo to Meta AI decision-makers noting
that after “escalation to MZ,” Meta’s AI team “[was] approved to use LibGen.” (MZ, here, is rather obvious shorthand for “Mark Zuckerberg.”)
An email outlined “mitigations” to reduce Meta’s legal exposure, including combing through Libgen files for words like “stolen” or “pirated,” and also simply not publicly citing usage. “We would not disclose use of Libgen datasets used to train.”
Earlier court filings indicated Meta considered buying the publisher Simon & Schuster in order to use published books to train their AI models, but the Meta execs determined it would take too long to negotiate licenses and reasoned that fair use was a solid defense.
New filings this week show portions of internal work chats between Meta staffers, and paint the clearest picture yet of how Meta may have come to use copyrighted data to train its
AI.
“[M]y opinion would be (in the line of ‘ask forgiveness, not for permission’): we try to acquire the books and escalate it to
execs so they make the call,” wrote Xavier Martinet, a Meta research engineer, in a chat dated February 2023, according to the filings. “[T]his is why they set up this gen ai org for [sic]: so we can be less risk averse.”
After another staffer pointed out that using unauthorized, copyrighted materials might be grounds for a legal challenge, Martinet doubled down, arguing that “a gazillion” startups were probably already using pirated books for training.
This summer the court is expected to decide whether Meta broke copyright laws. If so, then authors who books were used will be officially certified as a class in the suit.
The case is one of many AI copyright disputes slowly winding through the U.S. court system.
For more on the AI race for
data and how Google may already have used yours to train AI, see this article in the New York Times.
Justine Bateman, a filmmaker, former actress and author of two books, told the Copyright Office that A.I. models were taking content — including her writing and films — without
permission or payment. “This is the largest theft in the United States, period,” she said in an interview.