

The bot doesn’t fare as well with Pulitzer winners, works of Global Anglophone texts, or works in the Black Book Interactive Project and Black Caucus American Library Association award winners.īamman’s experiment raises some important questions to consider going forward. Martin’s fantasy Game of Thrones– titles with particular nerd appeal that netizens have quoted at length over the lifetime of the web. This falls into two broad categories: books in the public domain (pre-1923 LitBank titles) like Alice’s Adventures in Wonderland or The Scarlet Letter, available online thanks to Project Gutenberg, which digitizes works of literature and copyrighted material– sci-fi novels like Fahrenheit 451 or George R.R.

Maybe it’s unsurprising that Bamman found the models were trained on materials widely available on the web. The higher the score, the more likely it was that the book was memorized (see Table 2) They did this 100 times for each piece of literature selected, and each book was then given a score based on how many times the bot was able to answer the question correctly. It worked like this: they’d enter a passage from a novel into the bot with a named character blocked out, then ask the bot to fill in the missing name. 99 works of genre fiction: sci-fi/fantasy, horror, mystery/crime, romance, and action/spy.95 works of Global Anglophone fiction (outside the U.S.101 novels written by Black authors, either from the Black Book Interactive Project or Black Caucus American Library Association award winners from 1928-2018.95 Bestsellers from the New York Times and Publisher’s Weekly from 1924-2020.90 Pulitzer Prize nominees from 1924-2020.Bamman, et al conducted a data archaeology by asking the platforms to complete a fill-in-the-blank exercise using different sources: Chang, postdoc Sandeep Soni, and undergraduate research apprentice Mackenzie Cramer.īecause the datasets LLMs like ChatGPT and GPT-4 are trained on aren’t publicly made available, Bamman and his team had to first figure out what books were used in the data training set. The paper was co-authored by I School Ph.D. In the recently published Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4, I School Associate Professor David Bamman reveals much about what is known and remains to be known about the large language model (LLM) fueling ChatGPT.

But there’s also a good chance they’d watch Doctor Who and know their way around a Dungeons and Dragons game. If ChatGPT were a person, they’d be well-versed in the classic novels you probably read in high school.
