What OpenAI's deal with the FT means for accessing information online

And how chatbots are diverging from search engines

Apr 30, 2024

white tablet computer on top of newspaper — Photo by Matthew Guay on Unsplash

The Financial Times and AI company OpenAI have announced a licensing deal that will see the paper’s content used to train tools including ChatGPT, which follows similar deals struck between OpenAI and publishers like Axel Springer and Le Monde. The deal allows OpenAI to “respond to questions with short summaries from FT articles, with links back to FT.com”, according to the FT’s own write-up (which presumably will itself be in OpenAI’s corpus before long). The deal creates a stark contrast with the ongoing legal battle between OpenAI and the New York Times, whose lawsuit alleges that OpenAI has sought “to free-ride on the Times’s massive investment in its journalism” by hoovering up copyrighted content from the open web.

In some ways this saga is a sped-up version of the long-running battle between Google and news organisations regarding the profit that the search engine makes from indexing news stories to include in its search results. Google has at different times both pushed back on and acceded to demands to pay news providers for including their stories in search results. (It’s notable in this context that Google is yet to strike any of deals with publishers for content in its ChatGPT rival Gemini).

But generative AI is a different beast. The nature of search engine results—plural—implies a variety of sources of information from which a user can choose (even though organic results are increasingly being squeezed out by ads, and though AI is increasingly used to “summarise” results from third-party sites). The “secret sauce” at the heart of Google’s appeal was its search ranking algorithm, which surfaced what it took to be the most relevant response to a user’s query. Notably, the original Google ranking algorithm drew inspiration from academic citation networks. ChatGPT, by contrast, aims to provide a clear-cut answer to a prompt, and will gladly to provide as much information as a user asks for—without citing, let alone linking to, the sources it uses.

Consider a recent exchange. I asked ChatGPT 3.5 to “tell me about the ‘miracle on the Hudson,’” and received a four-paragraph response. Next, it happily obliged when I prompted it to turn this response into a 1,500 word essay (it subtitled it “A Tale of Courage and Survival”). Finally, I asked ChatGPT about the sources it used for the above essay, and it told me that “I don't have direct sources to cite, but the information provided is based on widely documented accounts and historical records”. It encouraged me to consult “official reports, news articles from reputable media outlets, books, … and documentaries that chronicle the incident” without naming any (the only specific source it cited was the Tom Hanks film Sully.)

ChatGPT, then, is built on both secret sauce (its immensely powerful general-purpose model) and secret sources: it is generally unwilling to reveal the specific source of information used in a response. This is likely both for technical and legal reasons: explaining why a large language model came to a particular conclusion or drew on a particular source (or how many sources it drew upon) is technically very challenging. But given the scrutiny of OpenAI’s use of copyrighted material, nor would it be in their interest to voluntarily reveal precisely which sources are used, even were this technically straightforward.

ChatGPT is built on both secret sauce (its immensely powerful general-purpose model) and secret sources

This makes OpenAI’s strategy of pursuing bespoke deals with publishers all the more intriguing. A divide-and-conquer strategy makes sense for OpenAI on a corporate level: establishing deals with some media companies will weaken the media industry’s overall resolve to push back on perceived copyright violations. And publishers who are already benefiting from such deals, such as Axel Springer, have favourably compared this new revenue stream to decades-long, fruitless efforts to seek compensation from Google and Meta. (Little has been said, in contrast, about how much of this revenue stream journalists themselves will see.)

But on a product level, if ChatGPT is willing to reveal and even link to some sources, like the FT, but not others, like the NYT, this could create a two-tier approach to attribution1 in which OpenAI privileges certain publishers over others, driving traffic in some but not all directions. The legal and economic ramifications of this are still unclear, but deals such as this one underline the powerful position that AI model providers like OpenAI increasingly occupy at the apex of important gateways to information.

This is in addition to the multi-tier pricing system OpenAI has in place for users.

plus/minus

What OpenAI's deal with the FT means for accessing information online

And how chatbots are diverging from search engines