S&P Global Inc. (SPGI) Earnings Call Transcript & Summary

March 6, 2025

New York Stock Exchange US Financials Capital Markets conference_presentation 57 min

Earnings Call Speaker Segments

Daniel Sandberg

executive
#1

Okay. Good morning, good afternoon, good evening to those joining us. My name is Dan Sandberg, and I head Quant Research for S&P Global Market Intelligence. We have an incredible webinar to share with you today: Questioning the Answers: LLMs enter the Boardroom. In this webinar, we're going to take a look at how large language models can be used to score executive performance on earnings calls to help investors make data-informed decisions. First off, as your moderator, I have a few housekeeping reminders. This webinar features closed captioning in English. To activate, simply click the closed caption icon in the media player. At the conclusion of the session, a brief survey will appear. Completing it takes less than a minute and your feedback is invaluable to us. We want this to be an interactive session, and we encourage you to submit your questions throughout the presentation. To ask a question, click the Q&A button at the bottom of the screen. All engagement tools are resizable and movable. If you're joining us for the replay, please use the Request Demo link found under the Related Content widget to reach out to us. This widget also includes links to white papers and other relevant collateral. You can also access our webinar replay portal to revisit this session and others on demand. With that now out of the way, we're ready to dive in. We have an all-star lineup joining us today. Liam Hynes, our Global Head of New Product Development for Quant Research will be walking through the team's work; Henry Chiang, Quant Analyst, is going to walk us through the code for this research; and Ronen Feldman, Founder of ProntoNLP, which is now an S&P Global Market Intelligence company, will discuss how Pronto and this research come together. Liam, take us away.

Liam Hynes

executive
#2

Great. Thank you, Dan. Okay, we're going to jump right into it. There's a lot of content to cover today. So first of all, we're going to start off with a little bit of a history lesson. On October 16, 2001, Enron held their third quarter 2001 earnings call. On the Q&A section of that call, one of the analyst's questions asked, "How confident can we be that these will be the last write-offs?" Kenneth Lay, Enron's CEO responded, "If we thought we had any other impaired assets, they would be on this list today. But we do still have at least 3 areas of uncertainty in the company, which you're aware of. Of course, one's California. We've got India. And then, of course, finally, broadband." Was Kenneth Lay reactive or proactive on this write-off topic in his pre-prepared remarks? When analysts posed questions on write-offs, there were 6 questions on write-offs from analysts, and write-offs were not mentioned once in the pre-prepared remarks. That meant that Kenneth Lay was being reactive rather than proactive on the write-off topic. Did Kenneth remain on topic to the analyst question? No, he did not. California, India, and broadband do not equal write-offs. Kenneth Lay was pivoting and totally off topic to the question asked. I think folks probably know the rest of the story. Enron collapsed in December 2001, wiping around $74 billion off its shareholder value, costing thousands of employees their jobs and retirement savings. Lay was actually indicted in 2004 on charges including fraud and conspiracy. And in May 2006, he was convicted on multiple accounts. This exchange with the analyst was actually evidence in the case the Department of Justice brought against Lay as prosecutors argued that Lay's statement was intentionally misleading as he failed to disclose Enron's true financial troubles. So that gives us a nice segue into some of the hypothesis that we want to test. The first hypothesis is do firms that remain on topic when answering questions during earnings call Q&A have superior stock performance compared to those that pivot to adjacent or unrelated topics? And the second, firms that proactively address key issues in their pre-prepared remarks before analysts ask about them in earnings call Q&A have superior stock performance compared to those that respond reactively. So what we're going to look at today, the dataset that we're looking at and this analysis was stood up on top of is S&P Market Intelligence machine-readable transcripts. The time period we're looking at is Jan 2008 to September 2004, and we're looking at the Russell 3000. Okay. So let's have a look at an example here. The first thing we need to do is identify the question-and-answer pairs in the earnings call and process them. In this example, we show Caterpillar's second quarter '24 earnings call and a question around adjusted margin on lower sales. What we do is we push this question to Snowflake's Cortex LLM summarization function. And what that does is it gives us a summarized question, given 2023 performance, should margins be adjusted on lower sales? And is this due to pricing? Okay. Well, why do we do that? Well, summarization does 4 things: it does noise reduction; it improves semantic matching; there's standardization and comparability; and then lastly, there's some computational efficiency to be gained by doing that. So let's look at the question-and-answer pairs. What do we actually do here? So first of all, we start off with the original question and answer in their long form, then we summarize those questions and answers. And then what we do is we use large language model embeddings. We vector embed the text. So the LLM vector embeddings represents text as a numerical vector. And what this does is it preserves the semantic meaning for efficient processing downstream. And by vector embedding the text, we can now identify whether the answer to the question is on or off topic by calculating the cosine similarity score between the question-and-answer vectors. So a high cosine score indicates that the answer is using concepts and language similar to the question, i.e., it's on topic, and a low cosine score equates to the answer being off topic. We do this for every question-and-answer pair in the earnings call, and then we get an average cosine score for the entire call. So what do we do downstream once we've done this feature generation? So we want to see if on-topic executives outperform their off-topic peers. We've just run through the feature engineering. Next, what we'll do is we'll form portfolios. So what we do is by taking the average cosine score, we resample it to the end of the month. We rank the companies in each sector. And then what we do is we go along the top 20% of companies with a high cosine score. And remember, that's a high on-topic score, and then we short the companies with a low cosine score or companies who are off topic. We then look at the 1-month forward returns, which we have Fama adjusted for -- Fama-French adjusted for market value, size and momentum. And then we've also adjusted the forward returns for some natural language processing signals: sentiment, language complexity and numeric transparency. So here are the results for the Russell baskets. So you can see here on the long side, the Russell 3000 generates statistically significant alpha of 190 basis points, and that's with a hit rate of 61%. And the smaller-cap Russell 2000 generates 200 basis points with a hit rate of 62%. On the long/short side, the Russell 3000 generates 390 basis points of alpha with a 63% hit rate, and the Russell 2 generates 450 basis points, again with a very strong hit rate of 63%. So pure alpha signal from identifying on- or off-topic executives on an earnings call. So now we're going to move on how we calculated the proactive and reactive score. For this, we used prompt engineering for an LLM to pretend it was an executive on an earnings call and to answer the questions. So the context that we gave the executive -- the LLM executive to answer the analyst question was just the pre-prepared remarks and any preceding answers to the previous questions. So for example, if there were 20 questions on the call, for the 20th question, we gave the LLM executive the pre-prepared remarks plus the answers to the preceding 19 questions for it to answer. And then we do the exact same process as we did previously. We take those features and we form the portfolios going long, the top 20% of proactive companies, shorting the bottom 20% of reactive companies, and we'll see the alpha results here. So again, long/short portfolio. On the long side, the Russell 3 generates statistically significant alpha of 74 basis points, hit rate of 53%. And the smaller-cap Russell 2000 generates just under 100 basis points there, 96 basis points with a hit rate of 61%. And on the long/short side, the Russell 3, 173 basis points, 55% hit rate. And then the smaller-cap Russell 2, there's more alpha there, 240 basis points with a hit rate of 57%. So again, a pure alpha signal from just identifying proactive and reactive executives on an earnings call. So just going to jump into now a couple of examples. Going back to the Caterpillar example. One of the questions from the earnings call was, the backlog increased by $300 million. Was this due to pricing or an increase in order volumes? Well, higher sales and order volumes were actually covered in the pre-prepared remarks by Andrew Bonfield, the CFO. And when the -- when we posed that question to the LLM executive, the answer was the backlog increase was due to a mix of higher prices and increased order volumes resulting from new orders and a dealer inventory changes. That came out with a cosine score of 0.89, putting it in the 87th percentile, and this indicated that the executive was being proactive. So what was the actual original executive answer? The original executive answer was price had significant impact while volumes fluctuate quarter-by-quarter. Backlog remains healthy. And how did that executive score? 82nd percentile and the executive was answering the question and on topic. So we look at how did Caterpillar perform 2 weeks post that earnings call, where the S&P 500 was down 2.16%? And Caterpillar actually outperformed. It returned 7.3% 2 weeks post the call, so a 9% active return above the S&P 500. On to another example. This is taking a question from Golden Ocean Group's first quarter '23 earnings call. So the question was, "Could you talk about the FFA markets? As Capes, large vessels, are in backwardation and smaller vessels are in contango." While FFA markets was not covered in the pre-prepared remarks by the executive, the LLM answer was, "The FFA market is volatile and no predictions are being made about the future." That came in with a low cosine score, 10th percentile, so the executive was being reactive, i.e., this topic was not covered in the pre-prepared remarks and the executive was reacting to the analyst's question. How did the executive actually answer the question? "Our comments are more in the longer-term perspective, so I won't be able to comment specifically on the FFA curve." I would say that, that's not even going off topic. I think that's changing the topic or just bluntly not even answering the topic. Cosine score is 0.69, 15th percentile. Executive is clearly answering off topic. What happened to Golden Ocean Group's share price 2 weeks post the call? The S&P 500 was up 1% and Golden Ocean Group dropped 17.3% 2 weeks post the call. Okay. So we've identified 2 alpha-producing signals. One is proactiveness and the second is on-topicness. But what happens if we create 4 communication styles from those 2 signals? So the first one we looked at is a proactive and on-topic manager. So this is an executive that is giving the analysts everything that they want to know in the pre-prepared remarks. And when the analysts ask a question, the executive is remaining on topic to the question asked. The second is proactive and off topic, so proactive in the pre-prepared remarks but answers the question off topic. The third is reactive and on topic. So even though they didn't cover the topics in the pre-prepared remarks, when analysts do ask about those topics in the Q&A, executives remain on topic to the question asked. And then you've got the entire flip side, you've got reactive and off topic. So these are executives where they haven't given everything that the analysts are looking for in the pre-prepared remarks. And when those analysts go looking for that information with their questions, the executives go off topic. And the performance is quite telling. So you can see here the blue line, the proactive and on-topic executives significantly outperformed their reactive and off-topic peers. This is a backtest that was done over the past 16 years on the Russell 3000. So proactive and on-topic managers generate 247 basis points of alpha per year. On the flip side, their reactive and off-topic counterparts generate negative 256 basis points of alpha per year. So that's a differential of 506 basis points on the long/short side from those 2 communication styles. So what we're looking at here is a table of those returns. So the top-left quadrant that you can see here is a proactive and on-topic manager, and the bottom right quadrant is a reactive and off-topic manager. Essentially, what we did is we did a dependent sort on proactiveness. So we looked at the proactive cosine score. We put that into 3 buckets, into 3 tertiles. And then within each of those tertiles, we tertiled it on the on-topic score. And what that does is it gives you 9 distinct portfolios. And what you're looking at here is the return of those portfolios, the t-stat in the brackets and the hit rate percentage below that. So firms with the most on topic and proactive executives outperformed those with the most off topic and reactive executives by more than 5% per year. The spread of on versus off topic is larger when executives are reactive, and the spread of proactive versus reactive is larger when executives are off topic. But an interesting exercise to do is to rebase the returns in the table. So if I'm looking at the top-left proactive and on-topic quadrant here, I can see that 100%. That represents the maximum return that you can get from this strategy. On the bottom-right bucket, the 0% there that you can see represents the 0 return that you can get from this. So the interesting thing is that it is the combination of the 2 evasive behaviors, both reactiveness and off-topic alignment that signals underperformance. So managers are significantly penalized if they're both reactive and coupled with off topic. It makes sense, right? They haven't covered some key topics in the pre-prepared remarks that analysts are looking for. And then when analysts actually go and try and find information from the executive on those topics, they're avoiding the question, going off topic and potentially being evasive.

Daniel Sandberg

executive
#3

Liam, fantastic so far. We're getting a lot of action through our Q&A widget here. I see a question has just come in. So just to clarify on this analysis, I'm going to rephrase the question a bit here. So what you're basically saying is that you could be the most off-topic executive, but as long as you're proactive, you can earn 75 cents on the dollar. And as long as you're on topic, even if you're the most reactive, you can earn 69 cents on the dollar. Is that the way to understand this?

Liam Hynes

executive
#4

Yes, that's -- yes, that's exactly right, Dan. Yes. So you can be entirely -- you can be an executive and be entirely off topic, but as long as you are proactive in the presentation, you're not penalized as much. And the same, you can be an entirely reactive executive on the earnings call, but as long as you're answering the analyst questions and remaining on topic, you're not penalized. It's when both of those characteristics are blended in an executive that they're really penalized. When they're both reactive and off topic, there's a significant deterioration to the returns of that company. Okay. So we've kind of run through -- we've shown empirical results that these 2 behavioral signals are efficacious of forward returns. But why? We've kind of experienced the what, but why are they experiencing these returns? Well, when a manager or an executive is in the earnings call and they are answering questions, what does their 1 year forward gross profit look like? So reactive and off-topic managers about 1 year from the earnings call generate around 12% growth in their gross profit. But proactive and on-topic managers have around 2.5x that growth. They generate 31% growth in their gross profit 1 year prior to the earnings call. So essentially, what that means is that you have executives that are exuding some confidence because they understand their business very well, and they can probably understand if there's any headwinds or any operational inefficiencies in their organization in the coming 12 months. So their nonevasiveness breeds transparency. They're willing to answer every question and remain on topic, and they're not evading or hiding anything from the audience.

Daniel Sandberg

executive
#5

We've got another good inbound here, Liam. A question, did you look at whether there was any multiple expansion within the company? So just sort of dovetailing on what you were saying there, it looks like firms can either appreciate and value via stock price through multiple expansion or through improving actuals. And so you're showing improving actuals here, which argues that there's real improvement in the firm. How about perception-wise? Did we see any growth in the multiple that the stock trades on for the proactive and on-topic firms?

Liam Hynes

executive
#6

Yes, there was some slight growth in that. The main effect on the multiples basically came from the actuals side. So there is some economic rationale when it comes to that. And thanks, Dan, you probably preempted the next slide I have. But essentially, proactively addressing key issues reduces speculative uncertainty, so it actually leads to a lower risk premium. So the more transparent you are, potentially the lower the risk premium you have. And what that can do is it results in a higher valuation multiple as investors pricing stability and there's some predictability around the future earnings growth. So very well-timed question there from the audience. Thank you. And then the other economic rationale that we have is strategic foresight and competitive positioning, right? So firms that proactively address investor concerns, they demonstrate strong risk management and strategic foresight. And what that does is it signals operational strength. Confident firms with durable competitive advantages are more likely to engage in clear direct communication, and then firms in weaker positions may avoid key topics signaling underlying vulnerabilities. So like I mentioned on the previous slide with the gross profit growth, if you're an executive and you know that your 12-year outlook isn't going to -- isn't that strong, it puts that executive in a weaker position. They may avoid key topics and they may try and avoid difficult or hard questions from analysts. I'm going to hand it over to Henry Chiang. Henry is going to run through a little bit of a coding tutorial on how we took the machine-readable transcripts and pointed it to the LLM API and came up with these 2 scores, the proactive score and the on-topic score. I'll hand it over to you, Henry there, if you want to share your screen.

Daniel Sandberg

executive
#7

And while Henry is pulling that up, I'll just let the audience know, as when Henry shares his screen, the media player should get larger within the webinar console. So the slides will still be visible on one side, but the demo is going to occur in the media player, and that should resize automatically for you. As a reminder, everything in the webinar console is resizable, movable. So if you need to adjust, now is a good time to do that. Go ahead, Henry.

Henry Chiang

executive
#8

Thanks, Dan. Thanks, Liam. So in my part, I'll be covering how to actually start from the transcript and divide those 2 signals that Liam just mentioned. So for the purpose of today's demo, I'm pulling up Caterpillar's Q2 2023 earnings call transcript in a PDF format. But keep in mind that on the back end, we have 196,000 transcripts in machine-readable format to be processed systematically. So let's have a look at this transcript. So in the transcript, you can have a look at all the participants. So there are 2 CEOs that attend this earnings call, and there are also analysts from investment banks that jump on to this earnings call and ask the questions. So there are 2 sections in an earnings call. First, they start with the presentation. So this is when the CEO and the CFO jump on to the call, tell everyone how much money they've made in that quarter, and they also talk about other things like risks, outlook, and basically provide details of the finance of the company. And then there's a second section which is the Q&A. So this is a section where the investment banking analysts, they jump on to the call, they heard all of the question -- they heard of the pre-prepared remarks that the executives just gave and they ask questions around those topics. So you can see here that these are different analysts. And here, the executive take turns to answer those questions. The demo that Liam just gave in the case study was this question that came from Tami Zakaria. She's an equity researcher from JPMorgan, and her question was on the backlog. So backlog increased by $300 million, and she was asking if that's purely driven by pricing or an increase in the order volume. So now I'm jumping on to Snowflake and show you how we process this systematically on the back end. So let me pull up the same transcript. It's the second quarter 2023 and the company is Caterpillar. So in here, you can see this transcript. We are starting from the Q&A pairing table. So essentially, what it says is that we have everything up from the very first question, which is normally just greetings paired to the answers. So we have all the questions-and-answers pair running from the first to the end. And on average, we have about 20 to 30 of these Q&A pairs in an earnings call transcript. On the back end, we pair all of these together so that we can process them. We can do all the vector embedding, the cosine similarity calculation that Liam just mentioned. And here, I'm going to show you how powerful it is, our machine-readable transcript product is. So it gives you the details of the transcript. So basic information like the call date, so this earnings call was conducted on the 1st of August 2023. It's the second quarter of 2023, and you can also see the headlines. So these are all just basic information of the earnings call, right? Nothing too surprising. But here, I'm going to show you how that really the part that how amazing our machine-readable transcript is. It's about how we can link each of the questions to the person that's asking the question. So we actually have our professionals dataset that allows you to map each of the components from the transcript to the person that's asking the question. So here, you can see that Tami Zakaria was on here, and you can see her question, Pro ID. So this allows you to map this person to their estimates. So it's a very powerful tool. It allows you to map the questions to the estimates. So you will be able to analyze things like whether in an earnings call, if an executive is only picking -- is picking more bearish than bullish analysts to ask the questions or stuff like mapping this to the estimates and allows you to sort of predict sort of the outcome of the company. And then you can also do that for -- on the answer side. So you'll be able to map that to the person who is answering that question by the Pro ID. So you know that some of them are coming from the CFO and some of them are coming from the CEOs. And this allows you to analyze the language that each of the person is using in answering these questions. So it opens up tons of NLP analysis work they can do to the transcript. So jumping back to our signals, what we did was that we vector embed both the questions and the answers, and then we can calculate the cosine similarity between them. That's for the first signal, the on-topic signal. And we also have the second signal, which is the proactive and the reactive signal. And now we'll have to generate the LLM response, and here is the prompt that generates all that. So as Liam mentioned, the prompt first started with pretend to be a top executive, please answer these questions that came from an analyst. And then we provide it with 60% of the pre-prepared remarks that the executive gave. And then we asked the same question that the analyst asked. So now let's take the example of the question that came from Tami Zakaria. It's a 24th question in the component order, so let's select that. And we can also have the answer that is corresponding to that question. So what's happening on the back end is that the prompt I just showed you, we are giving that to an LLM, and the LLM is generating the LLM response on the fly. So here, you can see that this was the question. It was on the $300 million increase in backlog. And here's the executive answer that we just saw in the PDF. And now this is the LLM answer. So as Liam mentioned, we also summarized all 3 of them for consistency and for standardization. Here is the summary of the question. So indeed, it was talking about $300 million increase. Here's the answer. The executive answered that the increase was due to both pricing and volume. The LLM answered similarly. It's about pricing and the volume. So from here, you can vector embed both the question and the answers and here are your signal scores. With that, I'm going to hand back to Dan, and he's going to introduce Ronen from Pronto.

Daniel Sandberg

executive
#9

Thank you so much, Henry and Liam. That was a fantastic deep dive into how we construct these signals. To quickly recap, we built these signals using machine-readable transcripts using an off-the-shelf large language model. The precalculated signals for this research are now an integrated part of ProntoNLP, providing actionable insights to investors. ProntoNLP also includes many other signals, including those generated with a fine-tuned, purpose-built large language model designed specifically for financial applications. And what we'd like to do now is to help us understand the ProntoNLP approach and how the -- this enhancement aids in the generation of the signals. We are going to turn it over to Ronen Feldman to discuss a little more about Pronto. Ronen, over to you.

Ronen Feldman

executive
#10

Thank you, Dan. Basically, what you see here is we take an earnings call. We break it into its sections like the presentation part and obviously, the Q&A pairs. And then we break it further into sentences. And if the sentences are complex, we break it even into multiple phrases. For each phrase, we identify the sentiment that we have for that phrase, positive, negative, and neutral. Then we also identify the importance, how important is that particular phrase within the context of the whole section. So we have high, medium, and low in terms of the importance. We can use it when we calculate scores. In the signals, we can use the importance in order to provide weights. Then we also have an explanation. The explanation is extremely important because it's like a guardrail to make sure that there are no hallucinations. We look at the explanation, and we see if it really correlates to the actual text. We look at the numbers that are in the explanation to see if they appear inside the text. If we see that there are -- there is no connection between the explanation and the actual text, we actually run it again. So that minimizes the chances of any hallucinations that we may get from the LLM. One of the nice things that we get from the LLM is that events are generated automatically. There is no predefined taxonomy. So unlike a lot of other competing products where you have a predefined taxonomy and then where there is a new topic, you do not discover it until you manually change it. We use the LLM to automatically detect new topics. The problem is that there are 2.7 million topics that were identified by the LLM. That's way too many for any quantitative signal. And this is why we use the embeddings of all the instances that the LLM identified and cluster them into 110 events. When you use actually our platform, which is part of the offering, you do get the original LLM tag. So you can actually see it and you can consume it using the API. Let's go now to an example that actually utilizes the topics that we identify. So you saw previously that we identify all those events. And you can see that combining all the amazing work that Dan, Liam and Henry did, we combine it with the topics that we identify. And then you can actually find interesting peaks for specific topics that are connected to off-topicness, reactiveness so to see what topic actually are exactly the topics that executives like to avoid. And let's look at the first one. The first one, we can see in 2021, we remember the supply chain issues during COVID. And that was a topic that both executives that were off topic, they did not like to discuss. The next one you can see is the recession, so this was in 2022. You can see that there was a big jump in terms of the negative sentiment related to recession. And that, again, was the main topic for those executives that wanted to avoid answering the direct questions. And when the questions were around recession, usually, they try to find some detour. And the last one that we can actually get here is margin, which was in 2023. And again, for all of those executives that were below the score that you can see here, the 0.8, we see that there was a big increase in such topics for those executives that try to avoid questions or find any way to go around them, things around margin. So back to you, Dan.

Daniel Sandberg

executive
#11

Thank you, Ronen. That was a great deep dive on ProntoNLP and a fantastic webinar so far, folks. We've got plenty, plenty of time for Q&A and a lot of questions coming in. So looking forward to that session. Before we jump into the questions, we've got a polling question up. To what extent have you incorporated large language models into your investment management workflows? And we'll give everyone just a few minutes -- a few seconds to fill that out there. Lots of folks in different parts of the journey and lots of considerations to be made before bringing this in. Okay. We've got another 10 seconds here to fill that out. Again, folks, there's a Q&A widget in the webinar console, and we'll be shifting. Questions coming in, in droves here. That's great. Keep them coming while you fill out that polling question. All right, let's take a look at our results. We actively use LLMs, only 8.3%; experimenting, 25%; exploring, 33%; and not currently using, 33%. So still a very early part of the journey for many folks. And I think this, hopefully, will help with getting started on that process.

Daniel Sandberg

executive
#12

Okay, so let's jump into some of the questions we've received so far. I'll give the first one over to Liam. A question came in. How do you account for sector-specific variations in executive Q&A when constructing your signals? I think you addressed that, but maybe you could just recap real quick.

Liam Hynes

executive
#13

Sure, sure. So I'm going to be on topic here. The best way to do this is probably repeat the question. So how do we control for sector variations in the portfolio construction? What we do is we obviously construct the on-topic scores and the proactive scores. And then what we do is we go into -- at the end of every month, we go into each sector and then we rank within the sector. So we'll get the top 20% and the bottom 20% in each sector, and then we combine those 11 sectors, top 20% and bottom 20%. And what that ensures is that there's an equal representation of sectors in the long portfolio and in the short portfolio.

Daniel Sandberg

executive
#14

Fantastic. So sector-neutral, equal representation in long and short. So we don't have a sector tilt in any of those. Here's another one. I really like this question. The question is why go through the process of prompting the LLM, summarizing the response, vectorizing the text and computing a cosine similarity score? Why not just ask the LLM, is this question on topic? Was it addressed in the prepared remarks?

Liam Hynes

executive
#15

Do you want me to take that one, Dan?

Daniel Sandberg

executive
#16

Sure.

Liam Hynes

executive
#17

Yes. Well, there's a few reasons why we didn't do that. One is that it's probably too subjective for the LLM. So -- and also, it would be a binary outcome for the LLM to determine whether or not the answer was on topic or not. So it would either say that the question was on topic or it was off topic. So it will be a binary assessment. And it's not a continuous quant assessments like the cosine score. So the LLM doesn't necessarily do this well. So if you go back to the Golden Ocean answer, technically, the LLM might have said that, that was on topic, right? Because the executive mentioned the FFA curve, but quantitatively, it was off topic. And we've obviously published this research. It's along with a coding notebook and there's a RAG engine that we stood up on the back end for this. And there's a couple of reasons why we went down that route. We need consistency in responses from the LLM. Technically, you could ask the LLM, is this question on or off topic? And then you could ask the LLM the same question again and it might give you a different answer. So something where you're relying on the LLM to be very subjective gives you very inconsistent results. So I could run a backtest today. And tomorrow, if I ask the LLM again, I might get varying results altogether. And so the way we did it this way is we ring-fenced the LLM to generate a very refined feature, and then we pipe that feature into our backtesting framework.

Daniel Sandberg

executive
#18

Yes, that's a very comprehensive answer. So yes, so quantification, to your point, you sort of quantify it as opposed to just labeling it with a binary label. And then if cosine score goes from 0 to 1 or minus 1 to 1 depending on whether the vectors are all positive or not, it's certainly -- one might think arbitrarily that off topic is less than 0.5 or something. But I think you were finding that most of the off-topic answers were 0.6, 0.7. It's really like about a 0.8 threshold that kind of divided the universal on the median. Is that right?

Liam Hynes

executive
#19

Yes, that's correct. And actually, we have a slide in the appendix. I just put it up on the screen there that might kind of help answer this subjective question, right? So we did a bit of an experiment where we asked an LLM to look at the Q&A of an earnings call. And we gave it 2 prompts. The first prompt said, "You are a financial expert," and then asked to score the Q&A section from very negative to very positive on a scale of minus 2 to plus 2, so 5 increments. And then the second prompt, we just did not include a financial expert. We just said, "Score the Q&A on a very negative minus 2 to a very positive plus 2." And you can see here in the chart that the financial expert prompt came out with a mean of 1. And then when we excluded the financial expert from the prompt, it actually came out closer to minus 1. So very small changes and the LLM can be very subjective. So you don't get these consistency in responses when you ask the LLM a very kind of broad-based question.

Daniel Sandberg

executive
#20

Makes sense. Makes sense. So the vectorization and the cosine score is all very consistent and well controlled, whereas the LLM response has more variability. This actually dovetails with another question we've got, and maybe Ronen, you could jump in on this one. As a generic question, what are the biggest challenges in applying LLMs in the financial markets? And how do you address those in both this research and ProntoNLP?

Ronen Feldman

executive
#21

So I think one of the main issues is first, you need to structure the input in the right way. In this case, we did it for earning calls. We actually do it for any kind of financial content like filings, and then you need to accommodate for a lot of things that you need to filter. So just the preprocessing, until you get really clean text that is structured in the right way, is very important. The other thing is I want to go back to the point that Dan mentioned before. We actually do heavy fine-tuning to the Llama models in order to get to a really good financial LLM. So we use over 10,000 tagged paragraphs using a bootstrapping approach, the classic teacher-student framework. And we saw that, that alone, if you do it in the right way, gave a huge boost in terms of the alpha that you get out of the signal.

Daniel Sandberg

executive
#22

Excellent. Thank you. Maybe, Liam, coming back to you. We've got a few questions here around comparing this approach to traditional sentiment analysis on earnings calls, and I'm just going to combine that with another question here. How does it work when an executive is proactive and on topic but the sentiment of the call was mostly negative? Did you control for the sentiment of the call when you were calculating your returns? And what other sort of considerations might have been made?

Liam Hynes

executive
#23

Yes. Let me just push it to the slide with the backtest results, yes. So you can see here on the third bucket there on the backtesting when we computed the portfolio returns, we looked at the 1-month forward returns. We did actually control for sentiment in the 1-month forward returns. So what we did is we stripped out the Fama-French for residuals, but then we actually looked at natural language processing signals. So sentiment being net positivity. So we looked at the Loughran-McDonald dictionary that was published, I think, in 2010. And we looked at all the positive words from that dictionary in the call, all the negative words, and you can look at a ratio of the net positivity. So we actually stripped that positive sentiment out of this signal. So it's a pure alpha signal. It doesn't have that sentiment variable embedded into it.

Daniel Sandberg

executive
#24

Got it, got it. So these -- and how about correlation-wise, do you see any major correlation between earnings call sentiment and these other signals?

Liam Hynes

executive
#25

No, actually. We did some Fama-MacBeth regression on it, and it actually came out very, very strong. So the on-topic score has got about 80 basis points of a coefficient with a very strong t-stat of around 3.5. And that's after we controlled for 7 or 8 fundamental factors and 3 or 4 sentiment factors. So no, it seems to be kind of a unique on topic and proactive signal that we're after finding here. It's not correlated to sentiment, which makes sense really when you think about it, right, because sentiment is looking at whether or not the executive was speaking positively or negatively, right? But theoretically, I could be answering a question on or off topic and still have a positive or negative angle on it. So I could be answering a question off topic but be very positive about how I was answering it off topic, right? So it is an accretive signal to sentiment.

Daniel Sandberg

executive
#26

Fantastic.

Liam Hynes

executive
#27

The way I like to think about it is this is more of a behavioral signal rather than a sentiment signal, right? We're identifying behavioral characteristics of the manager on the call. And the behavior is, how did they answer the question? Are they on topic or off topic? And then what's their presentation like? Are they proactive and reactive? So I would say it's mutually exclusive to the sentiment.

Daniel Sandberg

executive
#28

Henry, this next question, I think, is for you here. Can you talk a little bit about the large language model that was used? You mentioned Llama. Which Llama model was used? And how much of the proactive on-topic signal is using the original transcript product? How much is part of ProntoNLP?

Henry Chiang

executive
#29

Absolutely. So we used the Llama-3.1-8B model. And the reason that we used this model is because at that time, we were looking for an LLM that has large enough context length. That is a big thing for us because if you think about it, an average earnings call transcript has about 10,000 tokens in its pre-prepared remarks and another 10,000 tokens in its answer. So we need to find a model that's large enough to fit all of these text in. GPT-3.5 was an option at that time, but then due to the context length constraint, we have to go with Llama. And in terms of whether this is built on top of Pronto, at that time, we -- this was purely done on machine-readable transcript. But then after we got those alpha results, we can combine this signal with Pronto, which is another value add and a strengthened result that we have.

Daniel Sandberg

executive
#30

Excellent. Thank you. Liam, coming back to you. On these signals, could an executive deliberately go off topic to avoid discussing bad news? And how often do we see executives going off topic during earnings calls?

Liam Hynes

executive
#31

Good question. So could an executive purposely go off topic to avoid answering a question? Is that what you asked?

Daniel Sandberg

executive
#32

Sure. And I think it's just a clarifying sort of question as to what you're trying to capture.

Liam Hynes

executive
#33

Right. Yes, we're essentially trying to capture evasiveness, right, or the executive not wanting to cover a particular topic. So that's what we're trying to capture when we're looking at the on topic and off topic from the executive. And it seems to be systematically when you look across the Russell 3000, it is efficacious and it is quite a strong signal. So executives when -- if they just exhibit that one characteristic of just not being able to answer the question and remain on topic, it's a very poor sign for forward results. And when you think about it, there's 2 reasons why you might have an executive who goes off topic, right? One is that they just mightn't have the competency to answer the question or they mightn't know. And the second one is that they're purposely pivoting or going off topic. Both of those reasons are bad, right? One is evasiveness, which is not good news, and the other one is the executive potentially just doesn't understand the question or the business model or the operations associated with it. I'm sorry, Dan, what was the second question you asked?

Daniel Sandberg

executive
#34

How often do executives go off topic? I think you had a slide towards the end there on the number of off-topic and reactive questions. And I think that was an absolute sort of number. Do you have a sense of what...

Liam Hynes

executive
#35

Yes. So actually, this is quite interesting actually. So if we look at a cosine score threshold of about 0.8 both on the proactive score and the off-topic score, you can see here that it kind of hovers around, let's say, 28% or 30% of executives who are reactive and off topic. So it's quite a sizable chunk, maybe anywhere between 1/4 to just under 1/3. And actually, if you look at this chart, you can actually see that there's some spikes here in the fourth quarter, right? And the reason we think that there might be some spikes in the fourth quarter is that the fourth quarter earnings call is normally going to be centered around the full year results. So when executives are delivering their pre-prepared remarks, it's mainly going to be centered around financials, and there might be a lot of room for other topics in there. Hence, you get this spike in reactiveness in the fourth quarter because they're discussing full year results. And also, you see that in them being off topic as well as in the fourth quarter, and that's potentially because they might be focusing on the financial results or they might be prioritizing other components in the earnings call. So they have less room for spontaneous discussion.

Daniel Sandberg

executive
#36

Fantastic. Maybe one more question. We've got about 5 minutes left, and all 3 of you maybe can weigh in. How should those that are interested in adopting large language modeling in their process and generative AI in general think about the approach? How should you get started?

Liam Hynes

executive
#37

Well, there's one way you can get started. We actually published -- let me jump on to the -- so if you scan this QR code, you'll be able to get access to the paper that we wrote, Questioning the Answers. But in this paper, we also had a coding notebook, and that is a very comprehensive, thousands of lines of code in there, in order to stand our readers up in on machine-readable transcripts, generating the data frame, getting those Q&A pairs, vector embedding the text, doing the cosine similarity scores, the whole shebang. Everything that we've covered in the presentation today and on the research is -- you can see that in the coding notebook. If you click a link on the PDF, it will bring you to a noncompute coding notebook where you can view all of our code, so fully transparent. That's one way that you could successfully stand up some LLM integration onto your textual data suite.

Daniel Sandberg

executive
#38

Okay. This has been a fantastic discussion, and I want to thank all of our speakers, Liam, Henry, Ronen, for sharing your expertise today. For those who want to revisit today's session, a webinar replay will be available. You can also check out the Related Content widget, which should have links to the white paper, the coding notebook, additional resources. On behalf of S&P Global Market Intelligence, thank you very much for joining us. We look forward to continuing this conversation. Have a great day ahead, folks.

Liam Hynes

executive
#39

Thanks, Dan. Thanks, folks.

Ronen Feldman

executive
#40

Yes. Thanks, Dan. That was wonderful.

For developers and AI pipelines

Programmatic access to S&P Global Inc. earnings transcripts and 32,000+ others is available through the EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments, full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.