Abstract

We investigate whether an LLM can successfully perform financial statement analysis in a way similar to a professional human analyst. We provide standardized and anonymous financial statements to GPT4 and instruct the model to analyze them to determine the direction of future earnings. Even without any narrative or industry-specific information, the LLM outperforms financial analysts in its ability to predict earnings changes. The LLM exhibits a relative advantage over human analysts in situations when the analysts tend to struggle. Furthermore, we find that the prediction accuracy of the LLM is on par with the performance of a narrowly trained state-of-the-art ML model. LLM prediction does not stem from its training memory. Instead, we find that the LLM generates useful narrative insights about a company’s future performance. Lastly, our trading strategies based on GPT’s predictions yield a higher Sharpe ratio and alphas than strategies based on other models. Taken together, our results suggest that LLMs may take a central role in decision-making.

Keywords: Financial statement analysis, Large language models, GPT4, chain-of-thought, neural network, asset pricing, earnings, direction of earnings changes, analysts

JEL Codes: G12, G14, G41, M41

⁴

1 Introduction↩︎

Can large language models (LLMs) make informed financial decisions or are they simply a support tool? Their advanced capabilities to analyze, interpret, and generate text enable LLMs to excel across a wide range of tasks, including summarization of complex disclosures, sentiment analysis, information extraction, report generation, compliance verification, etc. [1]–[6]. All these tasks, however, involve the textual domain and require specialized training or fine-tuning of the model.⁵ The boundaries of this disruptive technology outside of the textual domain and with respect to more general tasks that require numeric analysis and judgment are yet to be understood. We probe these boundaries in the financial analysis domain.

We study whether an LLM can successfully perform financial statement analysis in a way similar to what professional human analysts do. The answer to this question has far-reaching implications for the future of financial analysis and whether financial analysts will continue to be the backbone of informed decision-making in financial markets. The answer is far from obvious, given that an LLM lacks the deep understanding of the financials of a company that a human expert would have. Further, one of the most challenging domains for a language model is the numerical domain, where the model needs to carry out computations, perform human-like interpretations, and make complex judgments [7]. While LLMs are effective at textual tasks, their understanding of numbers typically comes from the narrative context and they lack deep numerical reasoning or the flexibility of a human mind.

Financial statement analysis (FSA), sometimes referred to as fundamental analysis, is a particularly useful setting to examine the role of LLMs in future decision-making. Traditionally, financial statement analysis is performed by financial analysts and investment professionals with the primary objective to understand the financial health of a company and determine whether its performance is sustainable. Unlike a typical task performed by an LLM, FSA is a quantitative task that involves analyzing trends and ratios. At the same time, it also requires critical thinking, reasoning, and ultimately, complex judgments. Importantly, unlike in other applications, such as answering bar or CPA exam questions [8], [9], an LLM cannot rely on its memory for the correct answer.

Our research design involves passing a balance sheet and income statement in a standardized form to the large language model, GPT 4.0 Turbo, and asking the model to analyze them. In particular, based on the analysis of the two financial statements, the model must decide whether a firm’s economic performance is sustainable and, more specifically, whether a company’s earnings will grow or decline in the following period. We focus on earnings because they are the primary variable forecasted by financial analysts and fundamental for valuation [10]–[12].

A key research design choice that we make is to not provide any textual information (e.g., Management Discussion and Analysis) that typically accompanies financial statements. While textual information is easy to integrate, our primary interest lies in understanding the LLMs’ ability to analyze and synthesize purely financial numbers. We use this setup to examine several research questions.

First, can a large language model generate economic insights purely from the numbers reported in financial statements absent any narrative context? How does an LLM’s performance compare to that of human analysts and do they add incremental value? Can the model’s performance be enhanced via instructions that emulate steps typically followed by financial analysts? How does LLM’s performance compare to other benchmarks, such as logistic regression and a state-of-the-art ANN design, and whether the model can offer additional insights?

Conceptually, an LLM can add value relative to a human analyst due to its ability to quickly analyze large quantities of unstructured data and a vast knowledge base that enables to model to recognize patterns, e.g., familiar business situations, in the data. It is not obvious, however, that these considerations are particularly relevant for the present task. In fact, there are a number of reasons to expect that professional analysts will outperform a machine-based approach to financial statement analysis. First, financial statement analysis is a complex and loosely defined task that involves ambiguity and requires common sense, intuition, and flexibility of the human mind. Second, it requires reasoning and judgment that machines presently lack. Finally, it necessitates a broader understanding of the industry and macro-economy.

When compared to a narrowly specialized ML application, such as an artificial neural net (ANN) trained for earnings prediction, an LLM also appears to be at a serious disadvantage. Training a specialized ANN allows the model to learn deep interactions that contain important cues that cannot be easily gathered by the general-purpose model without providing additional insights or context. Nevertheless, an LLM’s advantage potentially lies in its vast knowledge and general understanding of the world, such as business concepts and investment theories that allow it to emulate deductive reasoning performed by humans. This could include intuitive reasoning and forming hypotheses based on incomplete information or previously unseen scenarios.

Our approach to testing an LLM’s performance involves two steps. First, we anonymize and standardize corporate financial statements to prevent the potential memory of the company by the language model. In particular, we omit company names from the balance sheet and income statement and replace years with labels, such as \(t\), and \(t-1\). Further, we standardize the format of the balance sheet and income statement in a way that follows Compustat’s balancing model. This approach ensures that the format of financial statements is identical across all firm-years so that the model does not know what company or even time period its analysis corresponds to.

In the second stage, we design prompts that instruct the model to perform financial statement analysis and, subsequently, to determine the direction of future earnings.⁶ In addition to a simple prompt, we develop a Chain-of-Thought (CoT) prompt that effectively “teaches" the model to mimic a financial analyst.⁷ In particular, as a part of their analysis, financial analysts identify notable trends in financial statement line items, compute key financial ratios (e.g., operating efficiency, liquidity, and (or) leverage ratio), synthesize this information, and form expectations about future earnings [17]. Our CoT prompt implements this thought process via a set of instructions ultimately making a determination of whether next year’s earnings will increase or decrease compared to the current year.

We test the model’s performance using the Compustat universe and, when necessary, intersect it with the IBES universe. The full sample spans the 1968-2021 period and includes 150,678 firm-year observations from 15,401 distinct firms. The analyst sample spans the 1983-2021 period with 39,533 observations from 3,152 distinct firms. Our target variable across all models is a directional change in future earnings. To evaluate analysts’ prediction accuracy, we compute consensus forecasts (the median of individual analyst forecasts issued in the month following the release of financial statements) and use them as an expectation for the following year’s earnings. This ensures the comparability of analysts’ forecasts and model prediction results.⁸ In addition, we also use three-month and six-month ahead consensus forecasts as alternative expectation benchmarks. These benchmarks disadvantage the LLM as they incorporate the information acquired during the year. However, because analysts may be sluggish to incorporate new information into their forecasts, we report them for comparison purposes.

We start by analyzing GPT’s performance compared to security analysts in predicting the direction of future earnings [13]. At the outset, we note that predicting changes in EPS is a highly complex task as the EPS time series are approximated by a random walk and contain a large unpredictable component. We find that the first-month analysts’ forecasts achieve an accuracy of 53% in predicting the direction of future earnings, which dominates the 49% accuracy of a naive model that extrapolates the prior year’s change.⁹ Three- and six-month ahead forecasts achieve a meaningfully higher accuracy of 56% and 57% respectively, which is intuitive given that they incorporate more timely information.

A “simple" non-CoT prompt GPT-based forecasts achieve a performance of 52%, which is lower compared to the analyst benchmarks, which is in line with our prior. However, when we use the chain of thought prompt to emulate human reasoning, we find that GPT achieves an accuracy of 60%, which is remarkably higher than that achieved by the analysts. Similar conclusions follow if we examine the F1-score, which is an alternative metric to evaluate a model’s forecasting ability (based on a combination of its precision and recall). This implies that GPT comfortably dominates the performance of a median financial analyst in analyzing financial statements to determine the direction a company is moving in.

We probe deeper to understand the strengths and weaknesses of humans relative to an LLM. Intuitively, human analysts may rely on soft information or a broader context not available to the model and thus add value [19], [20]. Indeed, we find that analysts’ forecasts contain useful insights about future performance not captured by GPT. Furthermore, we show that when humans struggle to come up with the future forecast, GPT’s insights are more valuable. Similarly, in the instances where human forecasts are prone to biases or inefficiency (i.e., not incorporating information rationally), GPT’s forecasts are more useful in predicting the direction of future earnings.

As human forecasts are known to exhibit statistical biases [21], [22], it is also interesting to examine GPT’s performance relative to specialized ML applications trained specifically to predict earnings based on a large dataset. We examine three such forecasting models. The first model follows [13] and relies on a stepwise logistic regression model with 59 predictors.¹⁰ Our second model is an artificial neural network (ANN) that uses the same 59 predictors but also leverages non-linearities and interactions among them. Third, to ensure consistency between GPT and ANN, we also use the ANN model trained on the same information set (the income statement and balance sheet) that we provide to GPT. Importantly, we train these models each year based on five years of historical data using a population of observations on Compustat. All forecasts are out of sample.¹¹

Using the entire Compustat sample, we find that the stepwise logistic regression achieves an accuracy (F1-score) of 52.94% (57.23%), which is on par with human analysts and consistent with the prior literature [13], [23]. In contrast, ANN trained on the same data achieves a much higher accuracy of 60.45% (F1-score 61.62), which is in the range of the state-of-the-art earnings prediction models. When we use GPT CoT forecasts, we observe that the model achieves an accuracy of 60.31% on the entire sample, which is very similar to the ANN’s accuracy. In fact, GPT exhibits a meaningfully higher F1 score compared to the ANN (63.45% vs. 61.6%). When we train the ANN exclusively using the data from the two financial statements (fed into GPT), which is a smaller information set, we find that ANN’s predictive ability is slightly lower, with an accuracy (F1-score) of 59.02% (60.66%), compared to GPT’s performance. Overall, these findings suggest that GPT’s accuracy is on par (or even slightly higher) than the accuracy of narrowly specialized state-of-the-art machine learning applications. This is a somewhat surprising result because specialized models are trained to leverage information most efficiently. It indicates a remarkable aptitude a pre-trained large language model possesses to analyze financial statements and even more so given that we do not provide any textual disclosures, such as MD&A.

We further observe that ANN’s and GPT’s predictions are complementary in that both of them contain useful incremental information with some indication that GPT tends to do well when ANN struggles. In particular, ANN predicts earnings based on the training examples it saw in the past data, and given that many of the examples are complex and highly multidimensional, its learning capacity may be limited. In contrast, GPT makes relatively fewer mistakes when predicting the earnings of small or loss-making companies, likely benefiting from its human-like reasoning and extensive knowledge. This ability to draw upon a broader range of knowledge provides a distinct advantage for the language model.

We perform several additional experiments partitioning the samples based on GPT’s confidence in its answers, and using different families of LLMs. When GPT answers with higher confidence, the forecasts tend to be more accurate than less confident forecasts. We also find that the earlier version, GPT3.5, shows considerably less impressive performance, suggesting that our main results should not be taken for granted. At the same time, we show that the results generalize to other LLMs. In particular, Gemini Pro, recently released by Google, achieves a similar level of accuracy compared to GPT 4.

Given the documented consistently impressive LLM’s performance in fundamental analysis, it is interesting to understand why the model is so successful. We examine two broad hypotheses. The first hypothesis is that GPT’s performance is driven by its (possibly near-perfect) memory. It would be especially problematic if GPT could somehow infer the company’s identity and year from the data and match this information with the sentiment about this company learned from newspaper articles or press releases. We aim to rule out this hypothesis (see Section 6.1). Furthermore, we replicate our results using the most recent year of data, which lies outside GPT4’s training period (i.e., pure out-of-sample tests).

Our second hypothesis is that GPT generates useful insights based on which the model infers the direction of future earnings. For example, we observe that the model frequently computes standard ratios computed by financial analysts and, as instructed by CoT prompt, generates narratives that analyze these ratios. To test this, we pool all narratives generated by the model for a given firm-year and encode them into 768-dimensional vectors (embeddings) using BERT. We then feed these vectors into an ANN and train it to predict the direction of future earnings. We find that the ANN trained on the GPT’s narrative insights achieves an accuracy of 59%, which is almost as high as the GPT forecast accuracy (60%). In fact, the embedding-based ANN achieves an F1-score that is higher than GPT’s (65% vs. 63%). This result presents direct evidence that the narrative insights generated by the model are informative about future performance. Further, we observe a 94% correlation between GPT’s forecasts and ANN forecasts based on the GPT’s narratives, suggesting that the information encoded by these narratives is the basis for GPT’s forecasts. We also find that narratives related to ratio analysis, in particular, are most important in explaining the direction of future earnings. In sum, the narratives derived from CoT reasoning are responsible for the model’s superior performance.

Finally, we explore the economic usefulness of GPT’s forecasts by analyzing their value in predicting stock price movements. We find that the long-short strategy based on GPT forecasts outperforms the market and generates significant alphas and Sharpe ratios. For example, alpha in the Fama-French three-factor model exceeds 12% per year. GPT stands out for doing particularly well in predicting the returns for small companies, as compared to ANN-based strategies.¹²

We make several contributions to the literature. First, to the best of our knowledge, we are the first to provide large-scale evidence on LLM’s ability to analyze financial statements – a complex task that is traditionally performed by human analysts. We show that an LLM can generate state-of-the-art inferences about the direction of the company, outperforming financial analysts and prior models and generating valuable insights along the way. Importantly, we show that the language model can successfully analyze numbers in financial statements without any narrative context.

Second, our results provide evidence on the limits of LLMs. In particular, the boundaries of generative AI to successfully perform tasks outside of their native domain are not well understood. We find that an LLM excels in a quantitative task that requires intuition and human-like reasoning. The ability to perform tasks across domains points towards the emergence of Artificial General Intelligence. Broadly, our analysis suggests that LLMs can take a more central place in decision-making than what is previously thought.

Third, we contribute to the literature on fundamental analysis. Starting from [13], there is a large literature in accounting that focuses on earnings prediction based on accounting fundamentals [14], [23], [24]. In particular, [14] predict the direction of earnings changes using tree-based machine learning models trained on over 12,000 exploratory variables based on firms’ XBRL tags.¹³ We use a novel approach to analyze financial information to derive insights about future performance. In particular, we show that an LLM-based financial statement analysis, by drawing on vast knowledge and chain-of-thought reasoning, complements humans as well as specialized models in generating value-relevant information. In that sense, we also contribute to the recent literature on the relative advantage of humans versus AI in financial markets [19], [20], [25].

2 Conceptual Underpinnings↩︎

Financial statement analysis, or fundamental analysis, has long been considered of critical importance for informed decision-making [26]. It uses the numbers reported in financial statements to gain insights into the financial health of the company, aiming to reveal information about a firm’s future prospects and valuation [13], [27], [28].

Financial statement analysis underlies the work performed by financial analysts, who play a pivotal role in financial markets.¹⁴ One of their primary tasks involves predicting firms’ earnings, which serves both as an input in their own stock market recommendations and an output that informs investors [29], [30]. When making earnings forecasts, their work typically begins with a systematic analysis of financial statements [17], often using standardized templates to ensure consistency and accuracy. This analysis enables financial analysts to establish a baseline understanding of a company’s financial position and performance, assessing factors such as operating performance or capital structure. They then contextualize this financial data by drawing upon their industry and private knowledge about the firm before issuing their forecasts [30]. The accuracy and quality of these forecasts not only drive market perceptions but also are fundamental to analysts’ career advancement and job security [22], [31].

Prior research generally concludes that sell-side analysts outperform time series models in terms of producing credible annual earnings forecasts [32]. Consequently, these forecasts are frequently used as a proxy for markets’ earnings expectations. At the same time, prior research has shown that financial analysts produce potentially erroneous or biased estimates [32], [33]. For example, [34] show that analysts make technical errors and questionable economic judgments when evaluating firms with quantitative methods. Evidence from [35] or [36] suggest that financial analysts overreact to recent events. These mistakes and biases highlight the complexity of processing information efficiently when large volumes of data are involved.

Recognizing these challenges in conventional financial forecasting and human information processing, general-purpose language models, such as ChatGPT, hold promise in facilitating financial statement analysis and the associated tasks such as earnings forecasting and decision-making more generally. These advanced AI systems are noted for their expansive knowledge across various domains and ability to quickly and efficiently process large quantities of data [37]. For example, their proficiency extends to answering CFA or CPA exam questions [9], demonstrating their financial knowledge and potential for understanding theories. In a similar vein, prior literature has shown that these models are capable of efficiently processing large sets of financial data [4], [5]. LLMs have also shown promise in predicting certain economic outcomes. [6] and [38] show that GPT can explain short-term stock returns based on newspaper headlines and [2] finds that GPT’s macroeconomic prediction aligns well with the expert survey results. In addition, [39] document that GPT can understand the political stance of FOMC announcements and relate it to future macroeconomic shocks.

However, despite the successes of large language models in many tasks, they are primarily viewed as a support tool and their ability to act autonomously to perform financial statement analysis at a level of a human analyst faces significant challenges. First, financial statement analysis is a broad task that is more of an art than science, whereas machines typically excel in narrow, well-defined tasks. It requires common sense, intuition, ability to reason and make judgements, ability to handle situations unseen previously. Second, LLM is not trained to analyze financial information, e.g., in the same way they are trained to summarize text or answer questions. In fact, inputs into the tasks performed by LLMs have been predominantly qualitative and language-based, and, LLMs have struggled with understanding numeric domain [7]. Third, humans are more capable of incorporating their knowledge of broader context – something a machine often cannot do – by taking into account soft information, knowledge of the industry, regulatory, political, and macroeconomic factors. These factors stack up against the odds that an LLM can achieve a human like performance in analyzing financial statements.¹⁵

An alternative to utilizing a general-purpose large language model for financial statement analysis involves specifying a more narrow objective, such as earnings prediction, and training a specialized ML model, such as Artificial Neural Network (ANN), to perform this task. Unlike the general-purpose large language models, which are trained to predict the next word in a textual sequence, ANNs learn deep interactions among a large number of predictors to deliver powerful forecasts of the target variable.¹⁶ Because LLMs are not trained to uncover these complex relationships among predictors, they are fundamentally disadvantaged relative to the specialized models in a specific prediction task. Nevertheless, the effectiveness of these ANNs can be limited if they encounter patterns not observed during training with sufficient frequency. This is where theoretical knowledge or general understanding of how the world works becomes essential, as does the value of human experience, intuition, and judgment. This grants possibly an important advantage to an LLM due to its training on a vast body of general knowledge that encompasses a multitude of business cases and situations, financial theories, and economic contexts. This broader theoretical foundation potentially allows LLMs to infer insights even from unfamiliar data patterns, providing an advantage in the complex domain of financial analysis.

3 Methodology and Data↩︎

In this section, we outline how we approach the primary task of using an LLM to analyze and predict earnings changes. Earnings prediction is a complex task that combines qualitative and quantitative analyses and involves professional judgment. We model how analysts make earnings predictions with a chain-of-thought prompt using GPT 4.

3.1 Financial Statement Analysis and Earnings Prediction↩︎

3.1.0.1 Overview

Earnings prediction derived from financial statement analysis is of considerable importance to accounting information users. For example, such predictions help investors to make inferences about the cross-section of expected stock returns [42] or to pick the best-performing stocks [27]. However, earnings are hard to predict as they are influenced by many exogenous factors such as macroeconomic shocks [43], product market demand shocks, changes in accounting standards [44], and many other factors. Therefore, predicting earnings is challenging even for state-of-the-art ML models [14], [24].

Financial analysts approach this complex task by performing financial statement analysis. They first analyze financial statements, identifying notable changes or trends in accounting information. They choose which financial ratios to compute to obtain further insights. Their analysis is enriched by contextual information, such as industry information, understanding of the competitive landscape, and macroeconomic conditions [17]. Based on this information, they apply professional judgments to determine whether a company’s earnings will grow or contract in the future.

In this study, we specifically focus on a relatively narrow information set that includes numerical information reported on the face of two primary financial statements. While this lacks textual information or broader context and thus puts an LLM at a disadvantage relative to a human, it presents a well-defined information set of exclusively numeric data. This approach allows us to test the limits of the model when analyzing financials and deriving insights from the numeric data – something that an LLM is not designed nor trained to do.

To approach FSA-based earnings prediction based on a Large Language Model, we implement two types of prompts. First, we use a “simple" prompt that instructs an LLM to analyze the two financial statements of a company and determine the direction of future earnings. This prompt does not provide further guidance on how to approach the prediction task, however.¹⁷ Second, we implement a Chain-of-Thought prompt that breaks down the problem into steps that parallel those followed by human analysts. This prompt effectively ingrains the methodology into the model, guiding it to mimic human-like reasoning in its analysis. We mostly focus on the results from this second prompt in our analysis.

3.1.0.2 Human Processing and Chain-of-Thought

Modern large language models can retrieve numbers from structured tables and perform simple calculations. However, they lack the ability to reason like a human and perform judgment. Recent research suggests that chain-of-thought prompting can significantly enhance the reasoning and problem-solving abilities of large language models [16].

We implement the CoT prompt as follows. We instruct the model to take on the role of a financial analyst whose task is to perform financial statement analysis. The model is then instructed to (i) identify notable changes in certain financial statement items, and (ii) compute key financial ratios without explicitly limiting the set of ratios that need to be computed. When calculating the ratios, we prompt the model to state the formulae first, and then perform simple computations. The model is also instructed to (iii) provide economic interpretations of the computed ratios. Then, using the basic quantitative information and the insights that follow from it, the model is instructed to predict whether earnings are likely to increase or decrease in the subsequent period. Along with the direction, we instruct the model to produce a paragraph that elaborates its rationale. Overall, this set of instructions aims to replicate how human analysts analyze financial statements to determine whether a firm’s performance is sustainable [17].

In addition to the binary prediction accompanied by a rationale statement, we also prompt the model to provide the predicted magnitude of earnings change and the confidence in its answer [2], [3]. The magnitudes contain three categories: large, moderate, and small. The confidence score measures how certain the model is in producing its answers and ranges from zero (random guess) to one (perfectly informed).

We use gpt-4-0125-preview, which is the most updated GPT model by OpenAI at the time of our experiment. The temperature parameter is set to zero to ensure minimal variability in the model’s responses. We do not specify the amount of max_tokens, and top-p sampling parameter is set to one (i.e., the most likely word is sampled by the model with probability one). In addition, we enable the logprobs option to obtain token-level logistic probability values. Figure 4 provides a visual illustration of GPT’s processing steps.

3.2 Data↩︎

We use the entire universe of Compustat annual financial data from the 1968 to 2021 fiscal years. We also set aside data for 2022 to predict 2023 fiscal year earnings to test for the robustness of the model’s performance outside GPT’s training window. In particular, the GPT-4-Turbo preview’s training window ends in April 2023, and the model cannot have seen the earnings data of 2023, which was released in late March 2024. Following prior literature, we require that each observation has non-missing total assets, year-end assets value exceeding one million dollars, a year-end stock price exceeding one dollar per share, and a fiscal period end date of December 31.¹⁸ We also drop observations where the balance sheet equation does not hold. These filters leave us with 150,678 observations from 15,401 distinct firms, reasonably approximating the Compustat universe.

For each firm-year, we reconstruct the balance sheet and income statement using the data from Compustat. The format follows Capital IQ’s balancing model and is the same across all firm years. We omit any identifying information, such as the firm name or dates of the financial statements. This step ensures that all firm-year observations have an identical financial statement structure. Consistent with US GAAP reporting requirements, we provide two years of balance sheet and three years of income statement data. An example of the two statements is provided in Appendix B.¹⁹

For the analysis that involves analyst forecasts, we use data from IBES, starting the sample in 1983. We extract individual forecasts and construct monthly consensus forecasts. This analysis restricts the sample to firm-years with analyst following. We require that each observation has at least three analyst forecasts issued, which leaves us with 39,533 firm-year observations.

We report descriptive statistics for the variables used in our analyses in Table 2. Panel A describes the full sample (1968-2021), and Panel B is restricted to the analyst sample (1983-2021). The data in Panel A reveals that approximately 55.5% of observations report an actual increase in earnings (Target). Predicted values include the prefix “Pred_" and vary depending on the model. For example, GPT prediction (\(Pred\_GPT\)) implies that, on average, 53.0% of observations will experience an increase in earnings. In Panel B, \(Pred\_Analyst1m\) denotes the forecasts issued within one month from the previous year’s earnings release. Analyst forecasts indexed by 3m and 6m suffixes are defined in an analogous manner. Compared to GPT, financial analysts tend to be slightly more pessimistic in their forecasts (fluctuating around 52% depending on the timing of the forecasts). Panel B also reveals that companies in the Analyst Sample are, on average, larger in size (Size), have a lower book-to-market ratio (BtoM), higher leverage (Leverage), and lower earnings volatility (Earn_Vol). However, they are similar in terms of the actual frequency of EPS increases.

4 How Does an LLM Perform Compared to Financial Analysts?↩︎

In this section, we evaluate the performance of a large language model in the analysis of financial statements aimed at predicting the direction of future earnings by using human analysts as a benchmark. All prediction models have a binary target variable, which indicates an increase or decrease in EPS in the subsequent year.

4.1 Prediction Methods and Evaluation Metrics↩︎

4.1.0.1 Naive Model

First, as a naive benchmark, we assume that the directional change in earnings will stay the same. In particular, if EPS has increased (decreased) in year \(t\) relative to year \(t-1\), the naive prediction for year \(t+1\) is also “increase" (”decrease").

4.1.0.2 Analysts’ Forecasts

We use a consensus analyst forecasts of year \(t+1\) EPS published following the announcement of year \(t\) earnings. If there are multiple forecasts issued by a single analyst, we use the closest one to the year \(t\) earnings release dates. This approach helps us to ensure that human analysts are making predictions of one-year-ahead earnings based on financial statements published in the current year. Then we take the median value of analysts’ forecasts and compare it to the actual year \(t\) EPS. We require at least three analyst forecasts in a given firm-year to compute median values. If the median forecasted EPS value is larger than the year \(t\) EPS, we label the prediction as “increase" and vice versa. Analyst forecast accuracy is then obtained in an analogous manner.

As a comparison, we also collect analyst forecasts issued at least three and six months after the release of year \(t\) financial statements. This ensures that the analysts have enough time to process the reported financials. However, this also means that the analysts will have access to one or two quarterly financial statements and other contextual information generated during the year \(t+1\). Therefore, human analysts generally have an informational advantage relative to the models that rely on time \(t\) information only.

4.1.0.3 Evaluation Metrics

We report two common metrics to evaluate the quality of the prediction method: accuracy and F1-score. Accuracy is the percentage of correctly predicted cases scaled by the total number of predictions made. F1-score is the harmonic mean of precision and recall. Precision measures the proportion of true positive predictions in the total positive predictions, while recall measures the proportion of true positive predictions out of all actual positives. In particular, F1-score is defined as follows: \[\begin{align} F1 = \frac{2\times TP}{2\times TP + FP +FN} \end{align}\] where \(TP\) is the number of true positive predictions, \(FP\) is the number of false positive predictions, and \(FN\) is the number of false negative predictions.

4.2 Main Results↩︎

Table [tab:tab2] compares GPT’s prediction accuracy with that achieved by financial analysts. Based on the first-month forecast following the release of prior year financial statements, analysts’ accuracy is 52.71% and F1 score is 54.48% when predicting the direction of one-year-ahead earnings. As expected, this is better than predictions based on a naive model (accuracy = 49.11% and F1 score = 53.02%). However, these results also reiterate the notion that changes in earnings are very hard to predict, even for sophisticated financial analysts. As expected, the analysts’ prediction accuracy improves through the course of the year \(t+1\), achieving an accuracy of 55.95% and 56.58% for month-three and month-six forecasts, respectively.

Turning to GPT’s predictions, we observe the following: Using a simple prompt instructing GPT to analyze financial statements and predict the direction of future earnings yields an accuracy of 52.33% and an F1-score of 54.52%. Thus, without CoT reasoning, the model’s performance is on par with the first-month consensus forecasts by financial analysts, following the earnings release. However, the performance markedly improves when we utilize CoT-based GPT forecasts. With chain-of-thought prompts, GPT achieves an accuracy of 60.35%, or a 7 percentage points increase compared to analyst predictions one month after the earnings release. The difference is statistically significant at 1% level.²⁰ This edge is particularly noteworthy since we do not provide to the language model any available to the analysts narrative or contextual information beyond the balance sheet and income statement.

Taken together, our results suggest that GPT can outperform human analysts by performing financial statement analysis even without any specific narrative contexts. Our results also highlight the importance of a human-like step-by-step analysis that allows the model to follow the steps typically performed by human analysts. In contrast, simply instructing the model to analyze complex financial statements does not yield strong prediction results.

4.3 Complementarity Between Human Analysts and GPT↩︎

Given that GPT outperforms human analysts in predicting future earnings, this finding raises the question of whether an LLM can largely replace human analysts. In our context, humans are expected to rely on a broader information set and hence should have an advantage over an LLM that does not have access to qualitative information, for example. More generally, humans often rely on soft information not easily accessible to a machine [19], [20], which puts humans at an informational advantage. We next explore the presence of complementarities and trade-offs related to LLM vs. human forecasts.

4.3.0.1 Sources of Incorrect Answers

We start with the analysis of instances where forecasts are erroneous. We estimate a simple linear regression to examine whether firm characteristics have systematic associations with prediction accuracy. \(I(\text{incorrect} = 1)\) is an indicator variable that equals one when the earnings prediction does not match the actual change in earnings. We then estimate the following OLS regression: \[\begin{align} \label{eq:det} I(\text{incorrect} = 1)_{it} = \beta \mathbf{X}_{it} + \delta_{year} + \delta_{ind} + \varepsilon_{it} \end{align}\tag{1}\] \(\mathbf{X}_{it}\) is a vector of firm \(i\)’s year \(t\) characteristics: asset size, leverage, book-to-market ratio, earnings volatility, loss indicator, and property, plant, and equipment scaled by total assets. \(\delta_{year}\) and \(\delta_{ind}\) denote year and industry (SIC two-digit) fixed effects, respectively. All continuous variables are winsorized at the 1% level and standard errors are clustered at the SIC two-digit industry level.

We present the results in Table [tab:tab3], Panel A, and Figure 5. In column (1), we document that GPT’s predictions are more likely to be inaccurate when the firm is smaller in size, has a higher leverage ratio, records a loss, and exhibits volatile earnings. These results are intuitive and, notably, prior studies find these characteristics to be economically associated with earnings quality.²¹ For comparison, in columns (2), (3), and (4), we report the determinants of analysts’ inaccurate predictions. Several interesting differences emerge compared to column (1). First, even though analysts face difficulties in predicting small firms’ earnings, the magnitude of these coefficients is nearly half compared to the coefficient in column (1) (p-value is less than 1% for all three comparisons). Considering that analysts have access to narrative information and broader context, this result is consistent with [53], who show that context matters more for prediction tasks when the firm is smaller in size. Another notable difference is that analysts are less likely to make errors relative to GPT when a firm reports a loss and exhibits volatile earnings. These findings are the same for all analyst forecast measures as the magnitudes of the coefficients on Loss and Earnings Volatility in columns (2), (3), and (4) are consistently smaller than that of column (1). Taken together, our results show that analysts and GPT both have difficulties in predicting the earnings of small, loss-reporting firms. However, analysts tend to be relatively better at dealing with these complex financial circumstances than GPT, possibly due to other soft information and additional context [19].

4.3.0.2 Incremental Informativeness

We next test whether analysts’ forecasts, despite lower accuracy, add useful insights incremental to GPT’s predictions. We regress an indicator \(I(Increase = 1)\), which equals one when subsequent period earnings increase and zero otherwise, on the direction of future earnings predicted by GPT and/or analysts. Specifically, we estimate the following OLS regression: \[\begin{align} \label{eq:inc} I(Increase = 1)_{it} = \beta_1 Pred\_GPT_{it} + \beta_2 Pred\_Analyst_{it} + \delta_{year} + \delta_{ind} + \varepsilon_{it} \end{align}\tag{2}\] where \(Pred\_X\) is an indicator that equals one when "\(X\)" (which is either “GPT" or”Analyst") predicts an increase in earnings, and zero otherwise. \(\delta_{year}\) and \(\delta_{ind}\) are year and industry (SIC two-digit level) fixed effects. Standard errors are clustered at the industry level.

The results are presented in Table [tab:tab3], Panel B. In column (1), we find that GPT’s prediction, on a standalone basis, is positively associated with future outcomes while controlling for industry and year-fixed effects. The same result holds for individual analysts’ forecasts as can be seen in columns (2), (3), and (4). Consistent with the results in Table [tab:tab2], analysts’ forecasts issued six months after the earnings release exhibit stronger associations with the actual outcomes than the forecasts issued one month after the earnings release (the adjusted R-squared in column (4) is 0.044, which is almost twice the adjusted R-squared value in column (2)).

In columns (5), (6), and (7), we include both GPT and analyst forecasts simultaneously in a single regression. Across all models, both coefficients are statistically significant. We observe that the coefficient on GPT is largely unchanged (its t-statistics marginally decreases from 2.99 to 2.67) and the coefficient on analysts’ predictions increases in magnitude when both variables are used simultaneously (e.g., from 0.073 in column (2) to 0.110 in column (5)). The adjusted R-squared value also increases from 0.070 in column (1) to 0.089 in column (5). These results indicate that GPT and human analysts are complementary, corroborating our results in Table [tab:tab3].

4.3.0.3 Does GPT Do Well When Humans Struggle?

To explore the relative advantage of an LLM compared to human analysts, we examine instances when human analysts are likely to struggle with accurately forecasting earnings. In particular, we identify instances where analyst forecasts are likely to be biased or inefficient ex ante. We also consider instances in which analysts tend to disagree about future earnings (exhibit dispersion).

To estimate ex-ante bias (inefficiency) in analysts’ forecasts, we run cross-sectional regressions of analyst forecast errors on the same firm characteristics as in Equation 1 . We then take the absolute value of the fitted values from this regression.²² Consistent with prior literature, forecast errors are defined as the difference between actual EPS and forecasted EPS, scaled by the stock price at the end of the last fiscal year. In addition to ex-ante bias, we measure the disagreement in analysts’ forecasts. Specifically, we use the standard deviation of analysts’ forecasted EPS values, scaled by the stock price at the end of the preceding fiscal year.

We then partition the sample based on the quartile values of analyst bias and estimate Equation 2 for each group. The results are presented in Panel C of Table [tab:tab3]. By comparing the coefficients in columns (1) and (2), we observe important differences. When the analysts’ bias is expected to be relatively low, GPT’s predictions receive a smaller weight (compared to that in column (2) when the bias is expected to be higher), and the coefficient on analysts’ predictions is relatively large. These differences are statistically significant at the 1% level. They suggest that GPT is more valuable in situations when human analysts are likely to be biased. Similar results follow in columns (3) and (4) when we partition the sample on analyst disagreement: GPT’s prediction receives more weight when analysts’ disagreement is high and vice versa.

Taken together, our results indicate that GPT’s forecasts add more value when human biases or inefficiencies are likely to be present.

5 Comparison with Specialized ML Models↩︎

So far, we have shown that GPT’s predictions largely outperform human analysts. As human analysts are known to have a systematic bias in their forecasts, we raise the bar and turn to more sophisticated benchmarks, including state-of-the-art machine learning models.

5.1 Methodology↩︎

Following [13] and [23], we focus on 59 financial variables obtained from the Compustat Annual database to predict future earnings but exclude the price-to-earnings ratio for consistency reasons (stock price is not financial statement information). We perform two different prediction exercises: stepwise logistic regression and ANN. In both cases, we use a rolling five-year training window. That is, we estimate (train) the model using data from years \(t-5\) to \(t-1\), and apply the trained model to the year \(t\) data to generate forecasts. By doing so, we ensure that the models do not learn from the test data during the training phase. Since our sample spans from fiscal year 1962 to 2021, we train 56 distinct models for each prediction method.

In the stepwise logistic regression, we follow [13] and only retain the significant variables from the first step when performing the second step of the procedure. The trained logistic regression then yields a probability value instead of a binary variable as its output. We classify observations with a probability value higher than 0.5 as an increase (and a decrease otherwise). In contrast to the logistic regression, the ANN model allows for non-linearity among the predictors. Our model has an input layer with 59 neurons, two hidden layers with 256 and 64 neurons each, and an output layer with two neurons [52]. The output layer produces a two-dimensional vector \((p_1, p_2)\), and we classify the outcome as an increase when \(p_1>p_2\) and vice versa. We use Adam optimizer, ReLU activation function, and cross-entropy loss. We use batch training with a batch size of 128. All input variables are standardized. Missing continuous variables are imputed as the year-industry average. We apply early stopping criteria with a patience of five epochs, which indicates that the model stops training when there is no improvement in performance for five consecutive epochs.²³ For each training phase, we assign a random 20% of the training sample to the validation set and optimize the learning rate and dropout rate. Specifically, we perform a grid search of nine iterations, using three learning rates (\(1e^{-5}, 1e^{-3},\) and \(1e^{-1}\)) and three dropout rates (0, 0.2, and 0.4).

5.2 Main Results↩︎

5.2.0.1 Overall Results

We report the results in Table [tab:tab4], Panel A, and Figure 6. Stepwise logistic regressions following [13] achieve an accuracy of 52.94% and an F1 score of 57.23%. We observe a considerably higher prediction accuracy using the ANN model. The model achieves a 60.45% accuracy and an F1-score of 61.62%. This result highlights the importance of non-linearities and interactions among financial variables for the predictive ability of numerical information.

Consistent with the results in the analyst sample, our CoT-based GPT predictions achieve an accuracy of 60.31%, which is on par with the specialized ANN model. In fact, in terms of the F1-score, GPT achieves a value of 63.45%, which is the highest among all prediction methods. This indicates a remarkable aptitude of GPT to analyze financial statements.²⁴ Not only does it outperform human analysts, but it generates performance on par with the narrowly specialized state-of-the-art ML applications.

We further examine the possibility that ANN versus GPT performance is partly driven by the slightly different input variables: we use balance sheet and income statement variables for GPT, but 59 [13] ratios for ANN. Thus, to ensure that the results are not an artifact of this choice, we also train an ANN model using the same balance sheet and income statement variables. We scale balance sheet items by total assets and income statement items by total sales. We also include change in revenue, change in lagged revenue, change in total assets, and revenue scaled by total assets. The ANN model with financial statement information achieves an accuracy of 60.12% and an F1-score of 61.30%, which are slightly lower than those of GPT.

5.2.0.2 Time Trends

We report the overall time trend of GPT’s and ANN’s prediction accuracy in Figure 7 (detailed annual accuracy and F1-scores are reported in Appendix A). The left panel shows a negative time trend in GPT’s prediction accuracy. In terms of the economic magnitude, GPT’s accuracy has decreased, on average, by 0.1% point per year, which translates into a decrease in accuracy by 5.4 percentage points over the 54-year sample period. Interestingly, we observe sharp drops in prediction accuracy in 1974, 2008-2009, and 2020. These periods overlap with international macroeconomic downturns: the oil shock in 1974, the financial crisis in 2008-09, and the Covid-19 outbreak in 2020. This result is comforting as GPT should not foresee unexpected, exogenous macroeconomic shocks if its performance is unrelated to memory.²⁵ Most importantly, in the right panel of Figure 7, we plot the time-series trend of the “difference" in the accuracy of GPT and ANN models. ANN models exhibit similar time trends compared to GPT with their annual differences fluctuating close to zero. Thus, for both evaluation metrics, we find a negative and statistically significant time trend, implying that it has become increasingly difficult to predict future earnings using only numeric information.²⁶

5.2.0.3 Sources of Inaccuracy

Next, we explore which firm characteristics are associated with the likelihood of making incorrect earnings predictions. Column (1) of Table [tab:tab4] focuses on the accuracy of GPT’s predictions and is consistent with our findings for the analyst sample (Table [tab:tab3]). We then report the determinants of the incorrect predictions by ANN and logistic regression models in columns (2) and (3), respectively. Both the ANN and logistic regression are also more likely to generate inaccurate predictions when firms are smaller, have higher leverage, record a loss, and have higher earnings volatility. However, interestingly, ANN is relatively more likely than GPT to make inaccurate predictions when firms are smaller and record a loss. A one standard deviation decrease in firm size reduces GPT’s prediction accuracy by 3.4 percentage points. In contrast, the same change in firm size is associated with a 5.5 percentage point decrease in prediction accuracy for the ANN model. The difference between the two coefficients is statistically significant at the 1% level. Similarly, the coefficients on Loss and Earnings Volatility are statistically different at the 5% level. The differences between logistic regression and GPT predictions are even more pronounced. These findings hint at the ability of GPT to make better predictions for less common data patterns (e.g., loss-making firms), presumably due to its ability to rely on its conceptual knowledge and theoretical understanding of business.

5.2.0.4 Incremental Informativeness

While GPT’s performance is comparable to that of an ANN, we also examine whether GPT conveys incremental information when compared to specialized ML models. This analysis is reported in Panel C. In columns (1) to (3), we show that across all models, predicted earnings changes, individually, are positively associated with the actual changes. In column (4), when both GPT and ANN forecasts are included simultaneously, both remain statistically significant and hence contain incremental information. Interestingly, the coefficient on ANN becomes one-third in magnitude (compared to column (2)) and its statistical significance deteriorates (from a t-statistic of 3.69 to 2.36), whereas the coefficient on GPT remains stable. This result suggests that GPT captures some additional dimensions of information than non-linear interactions among financial variables when predicting future earnings, e.g., external theoretical knowledge.

5.3 Confidence, Magnitude, and Generalizability↩︎

5.3.1 LLM’s Confidence↩︎

5.3.1.1 Method

We estimate the confidence of LLM’s answers based on two methods. First, we explicitly instruct the model to report a confidence score on its earnings prediction, with one being perfect confidence and zero being a pure guess [2]. Second, we compute an alternative confidence score based on token-level logistic probability values, which we directly take from the probability vector provided by the model. Specifically, we average the logistic probability values across all output tokens to measure the overall certainty of the model answer.

5.3.1.2 Results

For both approaches, we report prediction results of the high confidence (fourth quartile) and the low confidence (first quartile) groups. We present the results in Figure 8 and columns (1) to (4) of Table [tab:tab5]. The model performs better when it reports greater confidence. In the high confidence group, the model achieves an average accuracy of 62.44% (63.15%) based on the reported confidence value (confidence score derived from logistic probabilities), which is approximately 2.6 (4.6) percentage points higher than the corresponding accuracy of the low confidence group. We find similar results based on the F1 score. Overall, this result indicates that the model is capable of distinguishing between instances where earnings are more predictable.

5.3.2 Magnitude↩︎

5.3.2.1 Method

Recall that we also instruct the model to provide the expected magnitude of earnings change: “large",”moderate", or “small." As in [13] and [23], we expect the model to be more accurate in determining the directional change when it predicts large rather than immaterial changes.

5.3.2.2 Results

We present the results in Figure 8 and columns (5) and (6) of Table [tab:tab5]. We find that the average accuracy is 62.03% when the model predicts large changes whereas it decreases to 60.22% for small changes. We document a similar pattern for F1 scores: 61.16% for large changes vs. 57.95% for small changes. Overall, when the model expects a larger change, its directional predictions are more accurate.

5.3.3 LLM type↩︎

5.3.3.1 Method

We also test whether the capabilities associated with a specific LLM type determine its predictive ability. In the main analysis, we use the most recent version of GPT, GPT-4-turbo. We also experimented with a less powerful LLM version from the same family, GPT-3.5-turbo, and otherwise used the same experimental settings. In addition, we also explored another family of LLMs provided by Google, namely, Gemini Pro 1.5 (also with the same experimental settings). Due to considerable processing time, we choose a random 20% sample for this set of analyses.

5.3.3.2 Results

We present the results in Figure 8 and Table [tab:tab5], columns (7) to (9). GPT 4 achieves the best performance, followed by Gemini 1.5, and GPT 3.5. Gemini 1.5 achieves an overall accuracy of 59.15%, which is close to that of GPT 4 (61.05%) in the same 20% sample. However, GPT 3.5 achieves an accuracy of only 52.29% and an F1-score of 59.17%, which are all substantially lower than our GPT 4 benchmarks. We also find that the outputs of GPT 4 and Gemini 1.5 are largely overlapping with only 1,808 out of 30,135 firm-years (approximately 6%) having opposing predictions. Overall, this analysis suggests that our findings are not confined to a specific family of LLMs. Although the final prediction results largely rely on the performance of the backbone language model, recent generations of LLMs are capable of analyzing financial statements and making informed decisions.

6 Where Does an LLM’s Predictive Ability Come From?↩︎

In this section, we aim to understand the sources of GPT’s predictive ability. We explore two broad explanations. The first explanation is that GPT’s performance comes from its memory, e.g., due to the model’s ability to identify the company based on numeric data. We aim to rule out this possibility as it undermines the integrity of the model’s predictions. Another explanation is that the strength of the model is in its ability to generate narrative insights based on its analysis of numeric data. We explore each of these possibilities next.

6.1 Is There a Look-ahead Bias in the Model?↩︎

An important concern with the reliance on a pre-trained large language model in a prediction task is its potential for a look-ahead bias [54]. For example, the model may have been trained on the company-specific financial data and, hence, already may “know" the answer as to whether earnings increased or decreased in the future (or have a general sense of how well the company did over time). Our research design is relatively immune from this potential bias [55] because we use a consistent anonymized format of financial statements across firms. This makes it virtually impossible for the model to infer a firm’s identity from the structure of financial statements or specific account names. We also ensure the statement does not contain any dates and use relative years, i.e., \(t\) or \(t-1\). This later mitigates the concern that the model has knowledge about macroeconomic trends in a specific year and uses it to predict future earnings. To appreciate this issue, imagine that the model was able to match a given set of financials to 2007. In this case, the model could draw on its knowledge of the major economic downturn in 2008 and adjust its prediction accordingly.

Even though the anonymous nature of financial statements should prevent the model from “guessing" the entity, we perform two formal analyses to further rule out this concern.²⁷

6.1.0.1 Can GPT Guess Firm Name and Year?

In this set of tests, we instruct the model to make guesses about the firm or year based on the financial statements that we provide. Specifically, we ask the model to provide the ten most probable firm names and the most probable fiscal year. Additionally, we force the model to produce outputs even when it believes that it cannot make any informed guess.

For economic reasons, our first set of experiments does not include any chain-of-thought prompts. We perform this experiment on 10,000 random observations. The results are presented in Table [tab:tab6], Panel A. We find that the model correctly identifies the firm name with an accuracy of 0.07%, which is lower than the accuracy of a random guess from the population of names in our data. In Figure 10 left panel, we plot the ten most frequently produced firm names. We find that the model almost always predicts the same set of ten firms, including Tesla, Facebook, and Amazon. This result is consistent with the model’s training objective to produce the most probable words (name in this case) conditional on its information. Absent an informative prior, the model is likely to predict the most visible or popular firms in its training corpus.

The accuracy of correctly guessing the year of financial statements is 2.95%. In the right panel of Figure 10, we plot the actual fiscal year and GPT’s prediction in one plane. We observe that almost all predictions are 2019, 2020, or 2021 independent of the actual year, which is inconsistent with the model’s ability to guess the year.²⁸

In the second set of experiments, we use the exact same chain-of-thought prompts as in the main analysis, but then ask the model to guess the firm name and year (instead of predicting earnings). We use a random sample of 500 observations. Panel B of Table [tab:tab6] contains the results. The findings confirm very low accuracy and thus address a potential concern that the CoT prompt is more capable of invoking the model’s memory. Taken together, our results strongly suggest that the model cannot make a reasonable guess about the entity or the fiscal year from the anonymous financial statements. Therefore, it is highly unlikely that the model is inadvertently using its “memory" about financial information to make earnings predictions.

6.1.0.2 Analysis Outside of GPT’s Training Window

As suggested in [54], the most effective way to rule out the model’s look-ahead bias is to perform a test outside of the model’s training window. OpenAI’s GPT4-Turbo preview was trained on data up to April 2023, thereby significantly limiting the scope to conduct this analysis. Nevertheless, we use financial statement data from fiscal year 2022 (released in January-March 2023) to predict earnings of fiscal year 2023 (released in early 2024).

We present the results in Table [tab:tab6], Panel C. As a comparison, we also report prediction results of the logistic regressions, analyst predictions, and ANN models. GPT achieves an accuracy of 58.96% and an F1 score of 63.91%. The accuracy (but not the F1 score) is slightly lower than the average reported in Table [tab:tab4], Panel A. However, recall that we find an overall decreasing time trend in GPT’s prediction accuracy. Specifically, as shown in Appendix A, GPT’s prediction accuracy is only 54.36% for the fiscal year 2021, and 59.01% for 2019 (GPT’s prediction accuracy plummets in 2020 during Covid-19 outbreak). In fact, both the out-of-GPT-sample accuracy and the F1 score are substantially higher than the average over the last 10 years (58.01% and 59.15%). Therefore, we interpret our results as GPT’s out-of-sample performance being closely in line with our “in-sample" results. Furthermore, GPT achieves a very similar accuracy score out-of-sample as the ANN model (58.96% versus 59.10%) and an even higher F1 score (63.91% versus 61.13%) for the same year, which is closely in line with our main findings. Taken together, this result corroborates our prior tests and confirms that the model’s predictive ability does not stem from its training memory.

6.2 Are LLM-Generated Texts Informative?↩︎

Next, we explore whether the model’s predictive ability comes from its ability to generate narrative insights about the financial health and future performance of the company, in line with its objective to analyze financial statements. We leverage the fact that our CoT prompts instructed the model to provide information besides the prediction itself: narrative description and interpretations of trend and ratio analyses, as well as the rationale behind the binary predictions. We start with descriptive analyses of the generated texts. Subsequently, we evaluate the information content of texts generated by GPT.

6.2.0.1 Descriptive Bigram Analysis

We begin with a descriptive approach, performing a content analysis of the texts generated by the model. This analysis involves counting the most common bigrams in the ratio analysis and the most common monograms (single words) in the rationale section. This method allows us to discern patterns and dominant themes that may contribute to the model’s analytical performance.

We present the results in Figure 10. In the left panel, we report the top ten most frequently used bigrams in the ratio analysis. We calculate the frequency by scaling the bigram counts with the total number of bigrams generated by the model. We find that the model most commonly refers to the operating margin. In addition to the profitability information, the model also frequently computes efficiency (asset and inventory turnover) and liquidity (current ratio, current assets, and current liability). The model’s rationale in making final predictions is generally consistent with its bigram analysis. In its decision, the model commonly refers to firm growth, liquidity, operating profitability, and efficiency. This consistent alignment between the themes identified in the bigram analysis and the model’s final predictions underscores the utility of LLM-generated texts in capturing essential financial indicators.

6.2.0.2 Information Content of Generated Text

We hypothesize that GPT is capable of predicting future earnings because it distills narrative insights about the financial health of the company from the numeric data. We thus examine whether GPT-generated texts contain information that is useful for predicting the direction of future earnings. To do so, we process each GPT output with a BERT-base-uncased model to obtain its 768-dimensional vector representation (note that GPT does not allow retrieving native embeddings, and thus, we use BERT).²⁹ We then design a new ANN model that uses these textual embeddings as inputs and train the ANN to predict the direction of future earnings (target variable). The model has two hidden layers, with dimensions of 256 and 64, and an output layer with two dimensions: probabilities of earnings increase vs. decrease \((p_1, p_2)\). We classify the outcome as an increase when \(p_1>p_2\) and vice versa. The model is otherwise analogous to the ANN models we estimated earlier.³⁰ We refer to this model as the embeddings-based model.

We report the accuracy, F1-score, and the area under the ROC curve (AUC) of the trained model in Table [tab:tab7], Panel B (note that we were not able to measure AUC for GPT forecasts and thus did not report it previously). Our embedding model achieves an accuracy of 58.95%, an F1-score of 65.26%, and an AUC of 64.22%. It is noteworthy that this model achieves the highest F1 score among all classification methods we examined previously. For comparison purposes, the second row of the table repeats the results of the ANN model based on variables from the two financial statements, which was previously reported in Table [tab:tab4]. This model achieves only a somewhat higher accuracy of 60.12%, but a considerably lower F1-score (61.30%) and AUC (59.13%). Overall, our results indicate that narrative text generated by GPT contains a significant amount of information useful in predicting future earnings, i.e., it indeed represents narrative insights derived from numeric data based on the CoT prompt. This result suggests that the narrative insights serve as the basis for GPT’s superior predictive ability. In untabulated results, we find that the correlation between GPT forecasts and the embeddings-based forecasts of future earnings direction have a correlation of 94%, which suggests that both rely largely on the same information set.

As additional analyses, we experiment with different ANN specifications by changing the input vectors. First, following [52], we include both textual vectors (GPT insights) and numeric data (scaled variables from financial statements) into the model, allowing for full non-linear interactions among the two inputs. This model is reported in row (3) of the table. We find that the dual-input model achieves the highest accuracy metrics: accuracy of 63.16%, an F1-score of 66.33%, and an AUC of 65.90%. This result reconciles with our prior evidence that GPT forecasts have incremental information beyond numeric inputs and also highlights the value of considering the narrative insights generated by an LLM when interpreting numerical information.

Finally, we examine the relative importance of different parts of the financial statement analysis performed by GPT. Specifically, the model analyzes trends, then switches to the ratio analysis, and concludes by providing a rationale behind its prediction. We obtain embedding vectors for each of the three types of generated narratives with the goal of assessing their relative importance. Specifically, we estimate three ANN models each of which leaves out one type of embedding vectors from the analysis. The ANN model that omits trend analysis exhibits an accuracy of 57.11%, which is approximately 1.8 percentage points lower than that of the ANN model that uses the entire text embedding. The ANN model, excluding ratio analysis, achieves an accuracy of 55.65%, which is almost 3.3 percentage points lower than that of the full ANN model. These results indicate that ratio and, subsequently, trend analysis add the highest and second highest informational value, respectively, when determining the future direction of the company. In contrast, excluding the rationale narrative does not change the model performance substantially (58.88%), implying that the rationale does not add information beyond the trend and ratio analyses.

7 Asset Pricing Tests↩︎

Having demonstrated that GPT’s predictions of the earnings direction have high accuracy and stem from the model’s ability to generate insights rather from memory, we now investigate the practical value of an LLM-based financial statement analysis by evaluating trading strategies based on GPT’s output.

In particular, signals that are informative about future expected profits should exhibit a positive association with expected stock returns in the cross-section of firms [42]. The asset pricing models typically use the current level of profitability as a proxy for future expected future profitability [56]. To the extent GPT forecasts have incremental information about future profitability, they should also predict future stock returns. We use GPT forecasts of whether earnings are likely to increase or decrease in the subsequent period, to form an investment strategy and evaluate its performance.

7.1 Methodology↩︎

Because our sample includes firms with December 31 fiscal year-end, their financial results are released by the end of March. Following prior literature, we allow approximately three months for the market to fully process the reported information and form portfolios on June 30 of each year. We hold the portfolio for one year and measure their Sharpe ratios and monthly alphas. We compare three types of strategies. The first strategy sorts stocks into portfolios based on GPT forecasts, and the other two perform sorts based on ANN and logistic regression forecasts that rely on numeric information.

7.1.0.1 ANN and Logistic Regressions

ANN and logistic regressions yield probabilities that earnings will increase in the subsequent year. We use these predicted probabilities to sort the stocks into ten portfolios. Then on June 30, each year, we take long positions in the top decile stocks and short stocks in the bottom decile.

7.1.0.2 GPT

Because GPT does not provide probabilities that earnings will increase or decrease, we follow a different approach to form portfolios. We rely on three pieces of information: binary directional prediction, magnitude prediction, and average log probability of tokens. In particular, for each fiscal year, we select stocks predicted to experience an “increase" in earnings with the predicted magnitude (of the change in earnings) of either”moderate" or "large." Then we sort those stocks on the average log probability values associated with the generated text. This allows us to choose stocks with relatively more confident forecasts (recall that model answers with high certainty are more accurate than the ones with low certainty). We then retain stocks with the highest log probabilities such that the number of firms retained each year constitutes 10% of our sample in that year (our goal is to construct an equivalent to a decile portfolio). We also do the same for the stocks predicted to experience a “decrease" in earnings. We filter stocks with a predicted magnitude of either”moderate" or “large", and sort them on log probability values. We then short the same number of stocks as that in the long portfolio, i.e., retain 10% from the total number of observations in that year with the highest expected confidence. By doing so, we match the number of stocks to the number of stocks included in ANN or logit-based portfolios.

7.2 Results↩︎

7.2.0.1 Sharpe Ratios

To compute Sharpe ratios, we form equal-weighted and value-weighted portfolios. For value-weighted portfolios, we rebalance the portfolio weights each month. Although value-weighted portfolios are less sensitive to small market capitalizations, it is difficult to rebalance the portfolios based on the stocks’ time-varying market caps in practice [38]. Recall that our prior findings suggest that GPT appears to have an advantage in analyzing smaller and relatively more volatile companies. We thus present the outcome of both the value- and equal-weighted strategies.

The results are presented in Table [tab:tab8], Panel A. We find that equal-weighted portfolios based on GPT predictions achieve a Sharpe ratio of 3.36, which is substantially larger than the Sharpe ratio of ANN-based portfolios (2.54) or logistic regression-based portfolios (2.05). In contrast, for value-weighted portfolios, we observe that ANN performs relatively better (Sharpe = 1.79) than GPT (1.47). Both dominate the logistic regressions (0.81).³¹ This result is consistent with our finding in Table [tab:tab4] that both GPT and ANN contain incremental information and are thus complementary. Overall, this analysis shows potential for using GPT-based financial statement analysis to derive profitable trading strategies.

7.2.0.2 Alphas

Next, we compute monthly alphas for each of the three investment strategies described above based on five different factor models, from CAPM to [42]’s five factors plus momentum. We present the results in Table [tab:tab8], Panel B.

Consistent with the results in Panel A, equal-weighted portfolios generate higher alphas in general. As expected, we observe a significant reduction in alphas when we include the profitability factor in column (4) (from 1.29 to 0.97 for portfolios based on GPT predictions), which is another proxy for future profitability. However, even after controlling for five factors and momentum, portfolios based on GPT’s predictions generate a monthly alpha of 84 basis points (column (5)), or 10% annually. Portfolios based on ANN and logistic regression estimates also generate positive alphas. However, their magnitudes and economic significance are smaller (60 basis points with a t-statistic of 1.89 for ANN and 43 basis points with a t-statistic of 1.96 for logistic regressions).

In Figure 9, we plot the cumulative log returns of portfolios based on GPT’s predictions from 1968 to 2021. The left panel shows the cumulative log returns for equal-weighted long and short portfolios separately. As expected, the long portfolio substantially outperforms the short portfolio. In the right panel, we plot the cumulative log returns for the long-short portfolio and compare them with the log market portfolio returns (dotted line). Notably, our long-short portfolio consistently outperforms the market portfolio even when the market experiences negative cumulative returns.

For value-weighted portfolios, consistent with Sharpe ratio results, ANN-based portfolios perform better compared to GPT with 50 basis points alphas even after controlling for the five factors and momentum. Portfolios based on GPT’s predictions achieve 37 basis points alpha with a t-statistic of 2.43 (column (10)). Portfolios based on logit estimates also exhibit positive alphas (31 basis points) though they are marginally insignificant (t-statistic = 1.55).

Overall, our analysis demonstrates the value of GPT-based fundamental analysis in stock markets. We also note that the stronger (weaker) GPT’s performance compared to ANN when evaluated on equal-weighted (value-weighed) strategies is intriguing and points to GPT’s ability to uncover value in smaller stocks.

8 Conclusion↩︎

In this paper, we probe the limits of large language models by providing novel evidence on their ability to analyze financial statements. Financial statement analysis is a traditional quantitative task that requires, critical thinking, reasoning, and judgment. Our approach involves providing the model with structured and anonymized financial statements and a sophisticated chain-of-thought prompt that mimics how human analysts process financial information. We specifically do not provide any narrative information.

Our results suggest that GPT’s analysis yields useful insights about the company, which enable the model to outperform professional human analysts in predicting the direction of future earnings. We also document that GPT and human analysts are complementary, rather than substitutes. Specifically, language models have a larger advantage over human analysts when analysts are expected to exhibit bias and disagreement, suggesting that AI models can assist humans better when they are under-performing. Humans, on the other hand add value when additional context, not available to the model is likely to be important.

Furthermore and surprisingly, GPT’s performance is on par (or even better in some cases) with that of the most sophisticated narrowly specialized machine learning models, namely, an ANN trained on earnings prediction tasks. We investigate potential sources of the LLM’s superior predictive power. We first rule out that the model’s performance stems from its memory. Instead, our analysis suggests that the model draws its inference by gleaning useful insights from its analysis of trends and financial ratios and by leveraging its theoretical understanding and economic reasoning. Notably, the narrative financial statement analysis generated by the language model has substantial informational value in its own right. Building on these findings, we also present a profitable trading strategy based on GPT’s predictions. The strategy yields higher Sharpe ratios and alphas than other trading strategies based on ML models. Overall, our analysis suggests that GPT shows a remarkable aptitude for financial statement analysis and achieves state-of-the-art performance without any specialized training.

Although one must interpret our results with caution, we provide evidence consistent with large language models having human-like capabilities in the financial domain. General-purpose language models successfully perform a task that typically requires human expertise and judgment and do so based on data exclusively from the numeric domain. Therefore, our findings indicate the potential for LLMs to democratize financial information processing and should be of interest to investors and regulators. For example, our results suggest that generative AI is not merely a tool that can assist investors [5], but can play a more active role in making informed decisions. This finding is significant, as unsophisticated investors might be prone to ignoring relevant signals [57], even if they are generated by advanced AI tools. However, whether AI can substantially improve human decision-making in financial markets in practice is still to be seen. We leave this question for future research. Finally, even though we strive to understand the sources of model predictions, it is empirically difficult to pinpoint how and why the model performs well.

Appendix A. Time Series of GPT’s Prediction Accuracy↩︎

This table shows time-series prediction accuracy and F1 scores of GPT and ANN. The last two columns are the differences between the two models (GPT - ANN). Time trend is obtained by regressing accuracy metrics on fiscal years, obtaining robust standard errors at the year level. *, **, and *** denote statistical significance at 10%, 5%, and 1% levels, respectively.

Table 1: No caption
	GPT		ANN		Diff
Fiscal Year	Accuracy	F1	Accuracy	F1	Accuracy	F1
1968	58.55%	67.19%	58.48%	67.45%	0.07%	-0.26%
1969	59.23%	59.85%	58.71%	59.32%	0.52%	0.53%
1970	55.51%	58.86%	55.27%	58.66%	0.24%	0.20%
1971	60.29%	70.33%	59.73%	69.89%	0.56%	0.44%
1972	72.96%	81.58%	71.26%	80.62%	1.70%	0.96%
1973	67.53%	74.97%	66.84%	74.60%	0.69%	0.37%
1974	57.32%	63.10%	55.93%	61.93%	1.39%	1.17%
1975	58.29%	67.21%	57.93%	66.89%	0.36%	0.32%
1976	68.31%	77.63%	68.00%	77.42%	0.31%	0.21%
1977	69.14%	78.30%	68.64%	77.96%	0.50%	0.34%
1978	69.84%	78.73%	69.26%	78.27%	0.58%	0.46%
1979	61.90%	67.70%	60.96%	66.84%	0.94%	0.86%
1980	61.98%	64.72%	61.04%	63.97%	0.94%	0.75%
1981	61.00%	58.45%	59.78%	57.37%	1.22%	1.08%
1982	58.89%	58.04%	57.69%	56.75%	1.20%	1.29%
1983	66.32%	71.88%	64.96%	70.93%	1.36%	0.95%
1984	57.82%	59.54%	56.45%	58.14%	1.37%	1.40%
1985	58.19%	53.48%	56.44%	51.33%	1.75%	2.15%
1986	60.45%	60.82%	58.92%	59.48%	1.53%	1.34%
1987	63.85%	69.07%	62.45%	67.90%	1.40%	1.17%
1988	61.01%	63.50%	59.84%	62.32%	1.17%	1.18%
1989	60.36%	59.25%	59.39%	58.30%	0.97%	0.95%
1990	59.16%	57.50%	58.78%	57.08%	0.38%	0.42%
1991	60.04%	58.91%	59.63%	58.29%	0.41%	0.62%
1992	58.64%	61.49%	58.09%	60.75%	0.55%	0.74%
1993	66.79%	70.03%	62.25%	66.47%	4.54%	3.56%
1994	64.58%	69.70%	63.48%	68.69%	1.10%	1.01%
1995	60.84%	65.53%	59.87%	64.13%	0.97%	1.40%
1996	63.96%	67.09%	63.00%	65.88%	0.96%	1.21%
1997	56.38%	56.57%	55.40%	55.04%	0.98%	1.53%
1998	56.24%	58.79%	54.46%	56.38%	1.78%	2.41%
1999	62.39%	64.27%	61.08%	62.82%	1.31%	1.45%
2000	55.27%	51.77%	54.22%	50.66%	1.05%	1.11%
2001	56.65%	54.92%	56.02%	54.38%	0.63%	0.54%
2002	56.51%	62.80%	55.68%	62.24%	0.83%	0.56%
2003	59.94%	66.97%	59.92%	67.22%	0.02%	-0.25%
2004	60.59%	67.68%	60.13%	67.54%	0.46%	0.14%
2005	60.17%	65.35%	59.69%	64.76%	0.48%	0.59%
2006	61.36%	63.24%	60.69%	62.46%	0.67%	0.78%
2007	60.73%	52.21%	60.27%	52.32%	0.46%	-0.11%
2008	51.06%	43.49%	50.62%	43.33%	0.44%	0.16%
2009	48.86%	51.61%	48.57%	51.76%	0.29%	-0.15%
2010	59.28%	66.29%	58.68%	65.98%	0.60%	0.31%
2011	57.52%	62.11%	57.36%	62.02%	0.16%	0.09%
2012	60.09%	63.62%	59.36%	62.49%	0.73%	1.13%
2013	60.81%	63.28%	60.21%	62.49%	0.60%	0.79%
2014	59.54%	59.75%	59.03%	59.04%	0.51%	0.71%
2015	59.76%	60.33%	59.18%	59.09%	0.58%	1.24%
2016	58.49%	59.44%	57.91%	58.77%	0.58%	0.67%
2017	59.35%	64.70%	58.41%	64.15%	0.94%	0.55%
2018	58.40%	60.31%	57.75%	59.52%	0.65%	0.79%
2019	59.01%	49.73%	58.35%	49.50%	0.66%	0.23%
2020	50.25%	55.24%	49.72%	54.41%	0.53%	0.83%
2021	54.36%	55.13%	53.54%	59.84%	0.82%	-4.71%
Trend	-0.001***	-0.002***	-0.001***	-0.002***	-0.000	-0.000
	(-3.17)	(-3.95)	(-3.13)	(-3.71)	(-1.49)	(-1.12)

Appendix B. Example Balance Sheet and Income Statement↩︎

Appendix C. Example Output↩︎

Appendix D. GPT’s Guess About Fiscal Years↩︎

In Table [tab:tab6], we show that the accuracy of GPT’s fiscal year guesses is 2.95%. Our sample spans the period 1968-2021, and one might have concern that a pure random guess leads to a probability of 1.85%, which is lower than GPT’s accuracy. However, given the distribution in GPT’s answers and the distribution in our universe, this is not the case.

We observe that out of 10,000 random samples, GPT’s fiscal year predictions only give years 2001 (0.02%), 2008 (0.47%), 2018 (0.02%), 2019 (3.50%), 2020 (32.60%), 2021 (63.31%), and 2023 (0.09%). GPT’s inability to produce balanced guesses already suggests that it cannot make informed guesses about fiscal years. However, to test this more formally, assume that we randomly draw a sample from the universe and that its fiscal year is \(i\). When \(i\) is not included in 2001, 2008, 2018, 2019, 2020, 2021, and 2023, the probability that GPT will guess the correct year is zero. If \(i\) is 2001, the probability that GPT will guess the correct year is 0.02% and if \(i\) is 2021, the probability increases to 63.31%. Now define \(p_i\) as the probability that GPT will guess the correct fiscal year given year \(i\).

One more thing that we should consider is that our universe is not a balanced panel. Our data is sparse in earlier years and more dense in recent years. Fiscal year 2021, for instance, account for 3.5% of the total observations. Let \(q_i\) be the proportion of fiscal year \(i\) in the entire sample. Then, the expected probability that a random draw from the population leads to a correct guess by GPT is \[\begin{align} Prob = \sum_{i=1968}^{2021} p_i q_i \end{align}\] This value is 3.3%, which is higher than 2.95%, the value we report in Table [tab:tab6].

fig: — Figure 4: **GPT Processing Details**

fig: — Figure 5: **GPT vs. Human Analysts**

fig: — Figure 6: **GPT vs. Machine Learning Models**

fig: — Figure 7: **Time Trend in Prediction Accuracy**

fig: — Figure 8: **Different GPT Specifications**

fig: — Figure 10: **Sources of Prediction**

fig: — Figure 11: **Equal-Weight Portfolio Cumulative Returns Over Time**

This table shows descriptive statistics of the variables used in analyses. Panel A uses the entire universe of Compustat and Panel B uses the intersection between I/B/E/S and Compustat. For Panel B, we require that each observation has at least three analyst forecasts issued. \(Pred\_X\) denotes an indicator variable that equals one when method \(X\) predicts an increase in earnings and zero otherwise. \(Target\) is an indicator that equals one when earnings increase in the next period and zero otherwise. Size is the log of total assets, BtoM is book-to-market ratio, Leverage is total debt over total asset, Earnings Volatility is the standard deviation of earnings over the past five years scaled by total asset, and PP&E is net property, plant, and equipment scaled by total asset.

Table 2: **Descriptive Statistics**
Panel A. Full Sample (1968 – 2021)
	N	Mean	Std	P25	P50	P75
Target	,678
Pred_GPT	,678
Pred_Logit	,678
Pred_ANN	,678
Pred_Random	,678
Size	,830
BtoM	,830
Leverage	,830
Earnings Volatility	,830
PP&E	,830
Panel B. Analyst Sample (1983 – 2021)
	N	Mean	Std	P25	P50	P75
Target	,533
Pred_GPT	,533
Pred_Analyst1m	,533
Pred_Analyst3m	,533
Pred_Analyst6m	,533
Pred_Random	,533
Size	,736
BtoM	,736
Leverage	,736
Earnings Volatility	,736
PP&E	,736

This table reports prediction performance of the random walk model, analysts’ forecast issued one month after previous earnings release (Analyst 1m), three months after previous earnings release (Analyst 3m), and six months after previous earnings release (Analyst 6m). GPT (wihtout CoT) denotes GPT’s predictions without any chain-of-thought prompts. We simply provide the model with structured and anonymous financial statement information. GPT (with CoT) denotes the model with financial statement information and detailed chain-of-thought prompts. Accuracy is the percentage of correct predictions out of total predictions. F1 is the harmonic mean of the precision and recall.

>cl*3>X & Accuracy & F1
Random Walk & 49.11% & 53.02%
Analyst 1m & 52.71% & 54.48%
Analyst 3m & 55.95% & 55.33%
Analyst 6m & 56.68% & 56.85%
GPT (without CoT) & 52.33% & 54.52%
GPT (with CoT) & 60.35% & 60.90%

[!h]
\caption{\textbf{Complementarities Between Human Analysts and GPT}}
  \scriptsize {\noindent *, **, and *** denote statistical significance at 10\%, 5\%, and 1\% levels, respectively.\\
  In Panel A, we investigate the determinants of incorrect predictions. I(Incorrect = 1), which is an indicator that equals one when the model makes incorrect predictions and zero otherwise. Independent variables are defined in Table \ref{tab:tab1}. All continuous variables are winsorized at 1\% and 99\% level. Standard errors are clustered at the industry level.  Column (1) uses GPT for I(Incorrect = 1) and columns (2), (3), and (4) use analysts' predictions. Panel B shows incremental informativeness of each prediction. Both independent and dependent variables are indicators. I(Increase = 1) is an indicator that equals one when actual earnings increase and zero otherwise. All independent variables are also indicators that equal one when respective method predicts an increase in earnings and zero otherwise. Standard errors are clustered at the industry level. In Panel C, we partition the sample based on analyst bias and dispersion. Bias is the forecasted portion of analysts' forecast error and dispersion is the standard deviation of analyst forecasts scaled by stock price at the end of prior fiscal year. Low and High denote first and fourth quartiles, respectively. F-test compares the magnitude of the coefficients on columns (1) and (2), and (3) and (4).}
\vskip 0.5cm
  \scriptsize
  \setlength\tabcolsep{4pt}
    \begin{tabularx}{\textwidth}{>{\hspace{0pt}}l*{9}{>{\centering\arraybackslash}X}}
    \toprule
    \multicolumn{5}{l}{\textbf{Panel A. Determinants}} \\
    \midrule
    Dep Var & \multicolumn{4}{c}{I(Incorrect=1)} \\
    \cmidrule{2-5}      & GPT   & Analyst 1m & Analyst 3m & Analyst 6m \\
          & (1)   & (2)   & (3)   & (4) \\
    \midrule
    \textit{Size} & -0.017*** & -0.008*** & -0.010*** & -0.010*** \\
          & (-5.16) & (-5.72) & (-4.69) & (-4.81) \\
    \textit{BtoM} & -0.022 & -0.016*** & -0.012** & -0.012** \\
          & (-0.99) & (-2.94) & (-2.21) & (-2.35) \\
    \textit{Leverage} & -0.145 & -0.032 & -0.029 & -0.029 \\
          & (-1.50) & (-0.37) & (-1.40) & (-1.36) \\
    \textit{Loss} & 0.193*** & 0.141*** & 0.146*** & 0.145*** \\
          & (4.76) & (7.02) & (6.90) & (6.09) \\
    \textit{Earnings Volatility} & 0.236*** & 0.169*** & 0.160*** & 0.132** \\
          & (2.69) & (4.08) & (3.46) & (2.47) \\
    \textit{PP\&E} & 0.133* & 0.041 & 0.036* & 0.031 \\
          & (1.67) & (1.18) & (1.71) & (1.25) \\
    Year FE & Yes   & Yes   & Yes   & Yes \\
    Industry FE & Yes   & Yes   & Yes   & Yes \\
    Adjusted R2 & 0.08  & 0.027 & 0.032 & 0.029 \\
    N     & 37,736 & 37,736 & 37,736 & 37,736 \\
    \end{tabularx}  \label{tab:tab3}
  \scriptsize
  \setlength\tabcolsep{4pt}
\begin{tabularx}{\textwidth}{>{\hspace{5pt}}l*{7}{>{\centering\arraybackslash}X}}
\toprule
\multicolumn{8}{l}{\textbf{Panel B. Incremental Informativeness}} \\
\midrule
Dep Var & \multicolumn{7}{c}{I(Increase=1)} \\
\cmidrule{2-8}\multicolumn{1}{r}{} & (1)   & (2)   & (3)   & (4)   & (5)   & (6)   & (7) \\
\midrule
GPT   & \multicolumn{1}{c}{0.182***} &       &       &       & \multicolumn{1}{c}{0.170***} & \multicolumn{1}{c}{0.151**} & \multicolumn{1}{c}{0.152**} \\
\multicolumn{1}{r}{} & (2.99) &       &       &       & (2.67) & (2.35) & (2.30) \\
Analyst 1m &       & \multicolumn{1}{c}{0.073***} &       &       & \multicolumn{1}{c}{0.110**} &       &  \\
\multicolumn{1}{r}{} &       & (3.11) &       &       & (2.43) &       &  \\
Analyst 3m &       &       & \multicolumn{1}{c}{0.098***} &       &       & \multicolumn{1}{c}{0.122***} &  \\
\multicolumn{1}{r}{} &       &       & (4.02) &       &       & (3.49) &  \\
Analyst 6m &       &       &       & \multicolumn{1}{c}{0.100***} &       &       & \multicolumn{1}{c}{0.124***} \\
\multicolumn{1}{r}{} &       &       &       & (4.05) &       &       & (3.62) \\
Year FE & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} \\
Industry FE & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} \\
Adjusted R2 & 0.07  & 0.025 & 0.043 & 0.044 & 0.089 & 0.091 & 0.091 \\
N     & 37,736 & 37,736 & 37,736 & 37,736 & 37,736 & 37,736 & 37,736 \\
\end{tabularx}
\begin{tabularx}{\textwidth}{>{\hspace{5pt}}l*{4}{>{\centering\arraybackslash}X}}
\toprule
\multicolumn{5}{l}{\textbf{Panel C. Human Bias and Dispersion}} \\
\midrule
Dep Var & \multicolumn{4}{c}{I(Increase=1)} \\
\cmidrule{2-5}\multicolumn{1}{r}{} & \multicolumn{2}{c}{Bias} & \multicolumn{2}{c}{Dispersion} \\
\cmidrule{2-5}\multicolumn{1}{r}{} & \multicolumn{1}{c}{Low} & \multicolumn{1}{c}{High} & \multicolumn{1}{c}{Low} & \multicolumn{1}{c}{High} \\
\multicolumn{1}{r}{} & (1)   & (2)   & (3)   & (4) \\
\midrule
GPT   & \multicolumn{1}{c}{0.075**} & \multicolumn{1}{c}{0.341***} & \multicolumn{1}{c}{0.118**} & \multicolumn{1}{c}{0.301***} \\
\multicolumn{1}{r}{} & (2.21) & (4.39) & (2.50) & (3.20) \\
Analyst 1m & \multicolumn{1}{c}{0.175***} & \multicolumn{1}{c}{0.093***} & \multicolumn{1}{c}{0.187***} & \multicolumn{1}{c}{0.058**} \\
\multicolumn{1}{r}{} & (8.54) & (3.05) & (6.59) & (2.35) \\
F-Test on GPT & \multicolumn{2}{c}{p-value \textless 0.01} & \multicolumn{2}{c}{p-value \textless 0.01} \\
F-Test on Analyst & \multicolumn{2}{c}{p-value \textless 0.01} & \multicolumn{2}{c}{p-value \textless 0.01} \\
Year FE & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} \\
Industry FE & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} & \multicolumn{1}{c}{Yes} \\
Adjusted R2 & 0.057 & 0.134 & 0.071 & 0.115 \\
N     & 9,410 & 9,396 & 9,448 & 10,093 \\
\bottomrule
\end{tabularx}

, **, and *** denote statistical significance at 10%, 5%, and 1% levels, respectively.
In Panel A, we compare the prediction performance of GPT and quantitative models based on machine learning. Stepwise Logistic follows [13]‘s structure with their 59 financial predictors. ANN ([13] variables) is a three-layer artificial neural network model using the same set of variables as in [13]. ANN (Financial statement variables) is a three-layer artificial neural network model using the same balance sheet and income statement variables as in our GPT variables. We scale balance sheet items by total assets and income statement items by total sales and also include change in revenue, change in lagged revenue, change in total assets, and revenue scaled by total assets. GPT (with CoT) provides the model with financial statement information and detailed chain-of-thought prompts. Accuracy is the percentage of correct predictions out of total predictions. F1 is the harmonic mean of the precision and recall. In Panel B, we investigate the determinants of incorrect predictions. I(Incorrect = 1), which is an indicator that equals one when the model makes incorrect predictions and zero otherwise. Independent variables are defined in Table 2. All continuous variables are winsorized at 1% and 99% level. Standard errors are clustered at the industry level. Column (1) uses GPT for I(Incorrect = 1) and columns (2), (3), and (4) use analysts’ predictions. Panel C shows incremental informativeness of each prediction. Both independent and dependent variables are indicators. I(Increase = 1) is an indicator that equals one when actual earnings increase and zero otherwise. All independent variables are also indicators that equal one when respective method predicts an increase in earnings and zero otherwise. Standard errors are clustered at the industry level.

>cl*2>X <multicolumn,3>Panel A. Other Models & &
& Accuracy & F1
Stepwise Logistic & 52.94% & 57.23%
ANN ([13] variables) & 60.45% & 61.62%
ANN (Financial statement variables) & 60.12% & 61.30%
GPT (with CoT) & 60.31% & 63.45%

>cl*3>X <multicolumn,4>Panel B. Sources of Inaccuracy & & &
Dep Var = & <multicolumn,3>I(Incorrect=1) & &
& GPT & ANN & Stepwise Logistic
& (1) & (2) & (3)
Size & -0.015*** & -0.024*** & -0.029***
& (-9.09) & (-11.33) & (-11.56)
BtoM & 0.001 & 0.002 & 0.002
& (0.38) & (0.73) & (0.69)
Leverage & 0.092*** & 0.085*** & 0.090***
& (6.30) & (5.88) & (6.02)
Loss & 0.134*** & 0.181*** & 0.202***
& (9.64) & (11.35) & (12.96)
Earnings Volatility & 0.040** & 0.062*** & 0.078***
& (2.09) & (6.35) & (8.02)
PP&E & 0.027* & 0.016 & 0.02
& (1.95) & (1.53) & (1.69)
Year FE & Yes & Yes & Yes
Industry FE & Yes & Yes & Yes
Estimation & OLS & OLS & OLS
Adjusted R2 & 0.097 & 0.102 & 0.109
N & 133,830 & 133,830 & 133,830

>cl*5>X <multicolumn,6>Panel C. Incremental Informativeness & & & & &
Dep Var & <multicolumn,5>I(Increase=1) & & & &
<multicolumn,1> & (1) & (2) & (3) & (4) & (5)
GPT & <multicolumn,1>0.181*** & & & <multicolumn,1>0.170*** & <multicolumn,1>0.179***
<multicolumn,1> & (3.43) & & & (2.67) & (3.35)
ANN & & <multicolumn,1>0.150*** & & <multicolumn,1>0.053** &
<multicolumn,1> & & (3.69) & & (2.44) &
Logistic & & & <multicolumn,1>0.088*** & & <multicolumn,1>0.068**
<multicolumn,1> & & & (2.99) & & (2.05)
Year FE & <multicolumn,1>Yes & <multicolumn,1>Yes & <multicolumn,1>Yes & <multicolumn,1>Yes & <multicolumn,1>Yes
Industry FE & <multicolumn,1>Yes & <multicolumn,1>Yes & <multicolumn,1>Yes & <multicolumn,1>Yes & <multicolumn,1>Yes
Adjusted R2 & 0.056 & 0.051 & 0.032 & 0.061 & 0.06
N & 133,830 & 133,830 & 133,830 & 133,830 & 133,830

[!h]
\caption{\textbf{Experimental Variations and GPT's Predictability}}
  \scriptsize {\noindent We compare the predictive performance of the model based on several experimental settings. Conf Score is the confidence score (ranging from 0 to 1) that the model produces. Confidence score measures how certain the model is in its answers. Log Prob is the averaged token-level logistic probabilities. High and Low in columns (1), (2), (3), and (4) denote first and fourth quartiles, respectively. Magnitude is the predicted magnitude in earnings change provided by the model. LLM Version denotes the family of LLM that we use for the experiment. Accuracy is the percentage of correct predictions out of total predictions. F1 is the harmonic mean of the precision and recall. In Panel B, we report the model performance of an ANN model based on text embedding. We use BERT-base-uncased model to extract contextualized embedding representation of the narrative financial statement analysis performed by the model. The input layer as 768 dimensions, two hidden layers have 256 and 64 dimensions each, and the final layer has one dimension. We use ReLU activation function in the first two transitions and sigmoid for the last transition. Batch size is 128. We use Adam optimizer and binary cross-entropy loss. The model is trained on rolling five-year training windows and hyper-parameters (learning rate and dropout) are determined based on a grid-search on the random 20\% of the training sample. ANN with Financial Statement Variables denotes the model in Table \ref{tab:tab4}, Panel A. AUC denotes the area-under-the-curve.}
\vskip 0.5cm
  \footnotesize
    \setlength\tabcolsep{0pt}
\begin{tabularx}{\textwidth}{>{\hspace{5pt}}l*9{>{\centering\arraybackslash}X}}
\toprule
\multicolumn{1}{l}{} & \multicolumn{2}{c}{Conf Score} & \multicolumn{2}{c}{Log Prob} & \multicolumn{2}{c}{Magnitude} & \multicolumn{3}{c}{LLM Version} \\
\cmidrule{2-10}\multicolumn{1}{r}{} & \multicolumn{1}{c}{High} & \multicolumn{1}{c}{Low} & \multicolumn{1}{c}{High} & \multicolumn{1}{c}{Low} & \multicolumn{1}{c}{Large} & \multicolumn{1}{c}{Small} & \multicolumn{1}{c}{GPT4.0} & \multicolumn{1}{c}{GPT3.5} & \multicolumn{1}{c}{Gemini} \\
\multicolumn{1}{r}{} & (1)   & (2)   & (3)   & (4)   & (5)   & (6)   & (7)   & (8)   & (9) \\
\midrule
\multicolumn{1}{l}{\hspace{9pt}Accuracy} & 62.44\% & 59.86\% & 63.15\% & 58.54\% & 62.03\% & 60.22\% & 61.05\% & 52.29\% & 59.15\% \\
\multicolumn{1}{l}{\hspace{9pt}F1}    & 66.47\% & 55.62\% & 65.16\% & 54.15\% & 61.16\% & 57.95\% & 65.82\% & 59.17\% & 62.23\% \\
\bottomrule    
\addlinespace[25pt] 
    \end{tabularx}
  \label{tab:tab5}

In this table, we test GPT’s memory. For Panel A and Panel B, we ask GPT to provide ten most probable names of the company and the most probable fiscal year, based on the standardized and anonymous financial statement information. In Panel A, we do not provide chain-of-thought prompts and in Panel B, we provide the same chain-of-thought prompts as in the main analyses. In Panel C, we repeat the main analyses for fiscal year 2022 to predict 2023 earnings. GPT’s training window terminates in April 2023 and our sample period provides a perfect out-of-sample test. Accuracy is the percentage of correct predictions out of total predictions. F1 is the harmonic mean of the precision and recall

>cl*3>X <multicolumn,3>Panel A. Without Chain-of-Thought & &
& Accuracy & F1 Score
Firm Name & 0.07% & 0.07%
Year & 2.95% & 0.41%
<multicolumn,3>Panel B. Chain-of-Thought & &
& Accuracy & F1 Score
Firm Name & 0.09% & 0.09%
Year & 1.01% & 0.27%
<multicolumn,3>Panel C. Out-of-Sample (Using 2022 data to predict fiscal year 2023) & &
& Accuracy & F1 Score
Logistic Regression & 54.47% & 60.60%
ANN & 59.10% & 61.13%
GPT & 58.96% & 63.91%
Analyst & 53.06% & 58.95%

[!h]
\caption{\textbf{Predictive Ability of GPT-Generated Texts}}
  \scriptsize {\noindent We report the model performance of an ANN model based on text embedding. We use BERT-base-uncased model to extract contextualized embedding representation of the narrative financial statement analysis performed by the model. The input layer as 768 dimensions, two hidden layers have 256 and 64 dimensions each, and the final layer has two dimensions (probability vector). We use ReLU activation function in the first two transitions and sigmoid for the last transition. Batch size is 128. We use Adam optimizer and cross-entropy loss. The model is trained on rolling five-year training windows and hyper-parameters (learning rate and dropout) are determined based on a grid-search on the random 20\% of the training sample. ANN with Financial Statement Variables denotes the model in Table \ref{tab:tab4}, Panel A. ANN with Text and FS variables denotes a model that allows full non-linear interactions among embedding neurons and FS variables. ANN with Adjusted Text Embedding denotes models with adjusted text inputs. GPT produces three main textual outputs - trend, ratio, and rationale. ANN excl. Trend denotes an ANN with an input embedding with only ratio and rationale analyses. ANN excl. Ratio and ANN excl. Rationale are defined likewise. AUC denotes the area-under-the-curve.}
\vskip 0.5cm
  \footnotesize
    \setlength\tabcolsep{0pt}
\begin{tabularx}{\textwidth}{>{\hspace{5pt}}l*3{>{\centering\arraybackslash}X}}
\toprule
                                       & Accuracy & F1 Score & AUC      \\ 
\midrule
ANN with GPT Text Embedding            & 58.95\%  & 65.26\%  & 64.22\%  \\
ANN with Financial Statement Variables & 60.12\%  & 61.30\%  & 59.13\%  \\
ANN with Text
  and FS Variables       & 63.16\%  & 66.33\%  & 65.90\%  \\
ANN with Adjusted
  Text Embedding     & ~        & ~        & ~        \\
~~~ ANN excl. Trend                    & 57.11\%  & 64.03\%  & 63.81\%  \\
~~~ ANN excl. Ratio                    & 55.65\%  & 62.36\%  & 61.89\%  \\
~~~ ANN excl. Rationale                & 58.88\%  & 65.15\%  & 64.16\%  \\
\bottomrule
\end{tabularx}
  \label{tab:tab7}

[!h]
    \caption{\textbf{Asset Pricing Implications}}
  \scriptsize {\noindent *, **, and *** denote statistical significance at 10\%, 5\%, and 1\% levels, respectively.\\
  In this table, we show asset pricing implications of GPT's predictions. We form portfolios on June 30 of each year and hold the portfolios for one year. To form portfolios based on GPT's predictions, for each fiscal year, we choose stocks with a binary prediction of ``increase" and a magnitude prediction of either ``moderate" or "large." Then we sort those stocks on descending average log probability values. From this selected subset of stocks, we long stocks equivalent to 10\% of the entire stocks available in the given fiscal year from those ranked highest in log probability. We also do the same for the stocks with a binary prediction of ``decrease." We filter stocks with a predicted magnitude change of either ``moderate" or "large", and sort them on log probability values. For ANN and logit, we sort the stocks on the predicted probability values of earnings increase. Then on June 30, we long stocks in the top decile and short stocks in the bottom decile. Panel A reports monthly Sharpe ratio. Panel B reports alphas based on CAPM, three-factor, four-factor, five-factor, and six-factor (five factors plus momentum).}
\vskip 0.5cm
    \scriptsize
    \setlength\tabcolsep{0pt}
        \begin{tabularx}{\textwidth}{>{\hspace{5pt}}l*{6}{>{\centering\arraybackslash}X}}
    \toprule
    \multicolumn{7}{l}{\hspace{5pt}\textbf{Panel A. Sharpe Ratios (monthly)}} \\     \midrule
    \multicolumn{1}{r}{} & \multicolumn{3}{c}{Equal-Weighted} & \multicolumn{3}{c}{Value-Weighted} \\
    \cmidrule{2-7}\multicolumn{1}{r}{} & (1)   & (2)   & (3)   & (4)   & (5)   & (6) \\
    \multicolumn{1}{r}{} & \multicolumn{1}{c}{High} & \multicolumn{1}{c}{Low} & \multicolumn{1}{c}{H-L} & \multicolumn{1}{c}{High} & \multicolumn{1}{c}{Low} & \multicolumn{1}{c}{H-L} \\
    \midrule
     & \multicolumn{6}{c}{GPT Predictions} \\
    \midrule
    Ret   & 1.72  & 0.44  & 1.28  & 1.04  & 0.48  & 0.56 \\
    Std   & 0.59  & 0.68  & 0.38  & 0.52  & 0.69  & 0.38 \\
    Sharpe & 2.92  & 0.65  & \cellcolor[rgb]{ .949,  .949,  .949}3.36 & 2.00     & 0.70   & \cellcolor[rgb]{ .949,  .949,  .949}1.47 \\
    \midrule
     & \multicolumn{6}{c}{ANN} \\
    \midrule
    Ret   & 1.40   & 0.51  & 0.89  & 1.11  & 0.59  & 0.52 \\
    Std   & 0.72  & 0.67  & 0.35  & 0.61  & 0.88  & 0.29 \\
    Sharpe & 1.94  & 0.76  & \cellcolor[rgb]{ .949,  .949,  .949}2.54 & 1.82  & 0.67  & \cellcolor[rgb]{ .949,  .949,  .949}1.79 \\
    \midrule
     & \multicolumn{6}{c}{Logit} \\
    \midrule
    Ret   & 1.38  & 0.50   & 0.88  & 1.04  & 0.62  & 0.42 \\
    Std   & 0.61  & 0.65  & 0.43  & 0.55  & 0.77  & 0.52 \\
    Sharpe & 2.26  & 0.77  & \cellcolor[rgb]{ .949,  .949,  .949}2.05 & 1.89  & 0.81  & \cellcolor[rgb]{ .949,  .949,  .949}0.81 \\
\end{tabularx}
\scriptsize
\begin{tabularx}{\textwidth}{>{\hspace{5pt}}l*{10}{>{\centering\arraybackslash}X}}
\toprule
    \multicolumn{11}{l}{\hspace{5pt}\textbf{Panel B. Alphas (monthly)}} \\
\midrule
    & \multicolumn{5}{c}{Equal-Weighted} & \multicolumn{5}{c}{Value-Weighted} \\
\cmidrule{2-11} 
    & CAPM & 3 Factor & 4 Factor & 5 Factor & 5F+Mom & CAPM & 3 Factor & 4 Factor & 5 Factor & 5F+Mom \\
    & (1)   & (2)   & (3)   & (4)   & (5)   & (6)   & (7)   & (8)   & (9)   & (10) \\
\midrule
\multicolumn{11}{c}{GPT} \\
    \midrule
    High & 1.03*** & 1.04*** & 1.05*** & 1.02*** & 1.03*** & 0.48*** & 0.58*** & 0.61*** & 0.52*** & 0.55*** \\
        & (10.47) & (16.28) & (16.21) & (15.67) & (14.78) & (5.93) & (7.82) & (8.08) & (6.78) & (7.35) \\
    Low & -0.20  & -0.29** & -0.24** & 0.05  & 0.19* & -0.23 & -0.42*** & -0.28* & -0.04 & 0.18 \\
        & (-1.24) & (-2.46) & (-2.03) & (0.40) & (1.65) & (-1.45) & (-2.91) & (-1.96) & (-0.27) & (1.34) \\
    \rowcolor[rgb]{ .949,  .949,  .949} H – L & 1.23*** & 1.33*** & 1.29*** & 0.97*** & 0.84*** & 0.71*** & 1.00*** & 0.89*** & 0.56*** & 0.37** \\
    \rowcolor[rgb]{ .949,  .949,  .949}   & (8.96) &(10.48) &(10.10) & (3.14) & (4.48) & (3.81) & (6.27) & (5.57) & (3.78) &(2.43) \\
\midrule
\multicolumn{11}{c}{ANN} \\
\midrule
    High & 0.98*** & 1.00*** & 0.99*** & 0.85*** & 0.82*** & 0.55*** & 0.59*** & 0.65*** & 0.56*** & 0.52*** \\
          & (8.35) & (14.23) & (11.32) & (10.06) & (8.55) & (6.30) & (8.16) & (9.34) & (6.95) & (6.88) \\
    Low & -0.13 & -0.23*** & 0.02  & 0.16  & 0.22  & -0.35* & -0.49*** & -0.23 & -0.16 & 0.02 \\
          & (-1.16) & (-2.99) & (0.32) & (0.62) & (1.53) & (-1.73) & (-3.19) & (-1.16) & (-0.89) & (0.66) \\
    \rowcolor[rgb]{ .949,  .949,  .949} H – L & 1.11*** & 1.23*** & 0.97*** & 0.69** & 0.60* & 0.90*** & 1.08*** & 0.88*** & 0.72*** & 0.50*** \\
    \rowcolor[rgb]{ .949,  .949,  .949}    &  (7.62) & (11.32) & (9.38) & (2.15) & (1.89) & (4.23) & (7.99) & (6.00) & (4.56) & (3.19) \\
\midrule

\multicolumn{11}{c}{Logit} \\

\midrule
    High &0.89*** &0.90*** &0.86*** &0.71*** &0.68*** &0.40*** &0.44*** &0.46*** &0.36*** &0.33** \\
          & (7.15) & (9.11) & (9.25) & (6.11) & (4.23) & (4.15) & (4.22) & (4.44) & (2.86) & (2.15) \\
    Low & -0.18 &-0.26* & -0.05 & 0.23  & 0.25  & -0.24 &-0.36** &-0.29* & -0.11 & 0.02 \\
          &(-1.23) &(-1.95) &(-0.26) & (1.06) & (1.10) &(-1.55) &(-2.26) &(-1.77) &(-0.95) & (0.05) \\
    \rowcolor[rgb]{ .949,  .949,  .949} H – L &1.07*** &1.16*** &0.91*** &0.48* &0.43* &0.64** &0.80** &0.75** & 0.47  & 0.31 \\
    \rowcolor[rgb]{ .949,  .949,  .949}    &(6.50) &(8.15) &(7.19) &(2.06) &(1.96) &(2.35) &(2.56) &(2.41) &(1.67) &(1.55) \\
\bottomrule

\end{tabularx}\label{tab:tab8}

References↩︎

[1]

Bernard, D., E. Blankespoor, T. de Kok, and S. Toynbee (2023). Confused readers: A modular measure of business complexity. University of Washington Working Paper.

[2]

Bybee, J. L. (2023). The ghost in the machine: Generating beliefs with large language models. arXiv preprint arXiv:2305.02823.

[3]

Choi, G.-Y. and A. Kim (2023). Economic footprints of tax audits: A generative ai-driven approach. Chicago Booth Research Paper (23-23).

[4]

Kim, A., M. Muhn, and V. Nikolaev (2023a). From transcripts to insights: Uncovering corporate risks using generative ai. arXiv preprint arXiv:2310.17721.

[5]

Kim, A., M. Muhn, and V. V. Nikolaev (2023b). Bloated disclosures: can chatgpt help investors process information? Chicago Booth Research Paper (23-07).

[6]

Lopez-Lira, A. and Y. Tang (2023). Can chatgpt forecast stock price movements? return predictability and large language models. arXiv preprint arXiv:2304.07619.

[7]

Brown, T., B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901.

[8]

Choi, J. H., K. E. Hickman, A. B. Monahan, and D. Schwarcz (2022). Chatgpt goes to law school. Journal of Legal Education 71, 387.

[9]

Eulerich, M., A. Sanatizadeh, H. Vakilzadeh, and D. A. Wood (2023). Can artificial intelligence pass accounting certification exams? chatgpt: Cpa, cma, cia, and ea? ChatGPT: CPA, CMA, CIA, and EA.

[10]

Penman, S. H. and T. Sougiannis (1998). A comparison of dividend, cash flow, and earnings approaches to equity valuation. Contemporary Accounting Research 15(3), 343–383.

[11]

Penman, S. H. (2001). On comparing cash flow and accrual accounting models for use in equity valuation: A response to lundholm and o’keefe (car, summer 2001). Contemporary Accounting Research 18(4), 681–692.

[12]

Monahan, S. J. et al. (2018). Financial statement analysis and earnings forecasting. Foundations and Trends in Accounting 12(2), 105–215.

[13]

Ou, J. A. and S. H. Penman (1989). Financial statement analysis and the prediction of stock returns. Journal of Accounting and Economics 11(4), 295–329.

[14]

Chen, X., Y. H. Cho, Y. Dou, and B. Lev (2022). Predicting future earnings changes using machine learning and detailed financial data. Journal of Accounting Research 60(2), 467–515.

[15]

Kahneman, D. (2011). Thinking, fast and slow. Macmillan.

[16]

Wei, J., X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35, 24824–24837.

[17]

Bouwman, M. J., P. A. Frishkoff, and P. Frishkoff (1987). How do financial analysts make decisions? a process model of the investment screening decision. Accounting, Organizations and Society 12(1), 1–29.

[18]

Bradshaw, M. T., M. S. Drake, J. N. Myers, and L. A. Myers (2012). A re-examination of analysts’ superiority over time-series forecasts of annual earnings. Review of Accounting Studies 17, 944–968.

[19]

Costello, A. M., A. K. Down, and M. N. Mehta (2020). Machine+ man: A field experiment on the role of discretion in augmenting ai-based lending models. Journal of Accounting and Economics 70(2-3), 101360.

[20]

Liu, M. (2022). Assessing human information processing in lending decisions: A machine learning approach. Journal of Accounting Research 60(2), 607–651.

[21]

Abarbanell, J. S. and V. L. Bernard (1992). Tests of analysts’ overreaction/underreaction to earnings information as an explanation for anomalous stock price behavior. The Journal of Finance 47(3), 1181–1207.

[22]

Basu, S. and S. Markov (2004). Loss function assumptions in rational expectations tests on financial analysts’ earnings forecasts. Journal of Accounting and Economics 38, 171–203.

[23]

Hunt, J. O., J. N. Myers, and L. A. Myers (2022). Improving earnings predictions and abnormal returns with machine learning. Accounting Horizons 36(1), 131–149.

[24]

Bochkay, K. and C. B. Levine (2019). Using md&a to improve earnings forecasts. Journal of Accounting, Auditing & Finance 34(3), 458–482.

[25]

Cao, S., W. Jiang, J. L. Wang, and B. Yang (2024). From man vs. machine to man+ machine: The art and ai of stock analyses. Journal of Financial Economics Forthcoming.

[26]

Graham, B. and D. L. Dodd (1934). Security Analysis. New York: Whittlesey House, McGraw-Hill Book Co.

[27]

Piotroski, J. D. (2000). Value investing: The use of historical financial statement information to separate winners from losers. Journal of Accounting Research, 1–41.

[28]

Sloan, R. G. (2019). Fundamental analysis redux. The Accounting Review 94(2), 363–377.

[29]

Stickel, S. E. (1991). Common stock returns surrounding earnings forecast revisions: More puzzling evidence. The Accounting Review, 402–416.

[30]

Brown, L. D., A. C. Call, M. B. Clement, and N. Y. Sharp (2015). Inside the “black box” of sell-side financial analysts. Journal of Accounting Research 53(1), 1–47.

[31]

Groysberg, B., P. M. Healy, and D. A. Maber (2011). What drives sell-side analyst compensation at high-status investment banks? Journal of Accounting Research 49(4), 969–1000.

[32]

Bradshaw, M. T. (2011). Analysts’ forecasts: what do we know after decades of work? Working Paper.

[33]

Kothari, S., E. So, and R. Verdi (2016). Analysts’ forecasts and asset pricing: A survey. Annual Review of Financial Economics 8(Volume 8, 2016), 197–219.

[34]

Green, J., J. R. Hand, and X. F. Zhang (2016). Errors and questionable judgments in analysts’ dcf models. Review of Accounting Studies 21, 596–632.

[35]

De Bondt, W. F. and R. H. Thaler (1990). Do security analysts overreact? American Economic Review, 52–57.

[36]

Bordalo, P., N. Gennaioli, R. Porta, and A. Shleifer (2019). Diagnostic expectations and stock returns. The Journal of Finance 74(6), 2839–2874.

[37]

Achiam, J., S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

[38]

Jiang, J., B. T. Kelly, and D. Xiu (2022). Expected returns and large language models. Available at SSRN.

[39]

Hansen, A. L. and S. Kazinnik (2023). Can chatgpt decipher fedspeak? Available at SSRN.

[40]

Dell’Acqua, F., E. McFowland, E. R. Mollick, H. Lifshitz-Assaf, K. Kellogg, S. Rajendran, L. Krayer, F. Candelon, and K. R. Lakhani (2023). Navigating the jagged technological frontier: Field experimental evidence of the effects of ai on knowledge worker productivity and quality. HBS Working Paper (24-013).

[41]

Li, E. X., Z. Tu, and D. Zhou (2023). The promise and peril of generative ai: Evidence from chatgpt as sell-side analysts. Available at SSRN 4480947.

[42]

Fama, E. F. and K. R. French (2015). A five-factor asset pricing model. Journal of Financial Economics 116(1), 1–22.

[43]

Ball, R., G. Sadka, and A. Tseng (2022). Using accounting earnings and aggregate economic indicators to estimate firm-level systematic risk. Review of Accounting Studies 27(2), 607–646.

[44]

Ball, R., S. Kothari, and A. Robin (2000). The effect of international institutional factors on properties of accounting earnings. Journal of Accounting and Economics 29(1), 1–51.

[45]

Ball, R. and G. Foster (1982). Corporate financial reporting: A methodological review of empirical research. Journal of Accounting Research, 161–234.

[46]

Ge, W. and S. McVay (2005). The disclosure of material weaknesses in internal control after the sarbanes-oxley act. Accounting Horizons 19(3), 137–158.

[47]

Ashbaugh-Skaife, H., D. W. Collins, and W. R. Kinney Jr (2007). The discovery and reporting of internal control deficiencies prior to sox-mandated audits. Journal of Accounting and Economics 44(1-2), 166–192.

[48]

Watts, R. L. and J. L. Zimmerman (1986). Positive accounting theory.

[49]

Hwang, L.-S., C.-L. Jan, and S. Basu (1996). Loss firms and analysts’ earnings forecast errors. The Journal of Financial Statement Analysis 1(2).

[50]

Hutton, A. P., L. F. Lee, and S. Z. Shu (2012). Do managers always know better? the relative accuracy of management and analyst forecasts. Journal of Accounting Research 50(5), 1217–1244.

[51]

Donelson, D. C. and R. J. Resutek (2015). The predictive qualities of earnings volatility and earnings uncertainty. Review of Accounting Studies 20, 470–500.

[52]

Kim, A. and V. V. Nikolaev (2023a). Context-based interpretation of financial information. Chicago Booth Research Paper (23-08).

[53]

Kim, A. G. and V. V. Nikolaev (2023b). Profitability context and the cross-section of stock returns. Chicago Booth Research Paper (23-11).

[54]

Sarkar, S. K. and K. Vafa (2024). Lookahead bias in pretrained language models. Available at SSRN.

[55]

Glasserman, P. and C. Lin (2023). Assessing look-ahead bias in stock return predictions generated by gpt sentiment analysis. arXiv preprint arXiv:2309.17322.

[56]

Novy-Marx, R. (2013). The other side of value: The gross profitability premium. Journal of financial economics 108(1), 1–28.

[57]

Blankespoor, E., E. Dehaan, J. Wertz, and C. Zhu (2019). Why do individual investors disregard accounting information? the roles of information awareness and acquisition costs. Journal of Accounting Research 57(1), 53–84.

The University of Chicago, Booth School of Business, alex.kim@chicagobooth.edu↩︎
The University of Chicago, Booth School of Business, maximilian.muhn@chicagobooth.edu↩︎
The University of Chicago, Booth School of Business, valeri.nikolaev@chicagobooth.edu↩︎
We appreciate insightful comments from Bok Baik, Mark Bradshaw, Yiwei Dou, Joachim Gassen, Seung-Yeob Han (discussant), Kalash Jain, Ralph Koijen, Laurence van Lent, Christian Leuz, and Sanjog Misra, Thorsten Sellhorn, Stephen Penman, and workshop participants at the Bernstein Quantitative Finance Conference, Balyasny Asset Management Group Finance Seminar, JNE Partners Seminar, Bloomberg Quant Seminar, LMU Munich, University of Chicago, University of North Carolina at Chapel Hill, Tuck Accounting Spring Camp at Dartmouth, and Korean-American Accounting Professors’ Association. Yijing Zhang provided excellent research assistance. The authors gratefully acknowledge financial support from the University of Chicago Research Support Center, Fama-Miller Center for Finance Research, and the Stevens Doctoral Program at the University of Chicago Booth School of Business.↩︎
For example, to be able to efficiently summarize texts, an LLM is trained on a large corpus of documents that involve summaries typically generated by humans↩︎
Focusing on predicting the direction of future earnings provides a specific and measurable objective, facilitating the benchmarking of the model’s performance. It is also consistent with early and more recent literature on this topic [13], [14]. Additionally, the focus on a binary variable is also motivated by the notion that most key decisions performed by humans are binary in nature [15].↩︎
Chain-of-thought prompts are known to enhance the model’s problem-solving capability and induce human-like reasoning [16].↩︎
Since the quantitative models use only financial statement variables, we thus align the timing of human forecasts with the timing of AI-based forecasts.↩︎
This finding is consistent with [18], who show that analysts are superior in predicting one-year ahead earnings.↩︎
We exclude predictors that rely on stock prices, in particular the P/E ratio, because balance sheets and income statements do not contain stock price information. This exclusion ensures comparability of our benchmark. The results are qualitatively similar, however, if we include this variable.↩︎
We use five years to allow the model’s parameters to change over time, which helps to ensure accuracy. We also experimented with longer windows and found similar results↩︎
This finding aligns with our earlier result that GPT is relatively better in predicting earnings for smaller companies compared to ANNs. Given that GPT’s training dataset likely contained a disproportionate amount of information from larger firms, this result further challenges the notion that GPT’s performance is merely a function of its memory.↩︎
The observed variation in prediction accuracy relative to [14] can be attributed to the considerably fewer predictive variables included in our sample. Additionally, when our analysis is confined to firms examined in [14], the prediction accuracy of GPT notably increases to 64%.↩︎
Analysts are often formally trained in financial statement analysis. For example, financial statement analysis is a major part of the Level I CFA exam.↩︎
These more complex quantitative tasks have been traditionally seen as outside of the LLM’s “technological frontier" [40]. Consistent with this argument, [41] processes earnings press releases and finds that GPT performs worse in predicting earnings relative to sell-side analysts.↩︎
Over time, methods for predicting earnings have progressively advanced within the accounting literature. [13] predict earnings changes using a stepwise logistic regression model that uses approximately 60 accounting variables as input. Most recently, [14] use 13,881 in-line XBRL tags and tree-based machine learning models to predict future earnings.↩︎
In particular, we simply present a standardized and anonymous balance sheet and income statement and ask the model to predict whether earnings will increase or decrease in the subsequent period.↩︎
Focusing on December 31 firms allows for more straight-forward asset pricing tests in Section 7 and is consistent with [13], [23].↩︎
Importantly, we do not train or fine-tune the LLM model on the financial statements. The model observes only a single balance sheet and income statement at a time, as provided in Appendix B.↩︎
GPT outperforms human analysts in terms of accuracy under 5% statistical significance.↩︎
Due to high fixed costs of maintaining adequate internal controls, small firms may have lower-quality accounting earnings [45], [46] and are more likely to restate their earnings in subsequent periods [47]. High leverage ratios are often indicative of firms being closer to debt covenant violations. Such firms might be more incentivized to engage in earnings management to meet or beat financial thresholds, leading to lower-quality earnings [48]. Also, when firms experience unusual financial circumstances such as reporting losses, analysts tend to perform worse than average [49], [50]. Lastly, [51] document that past volatility of earnings is negatively associated with its predictive power. Considering that GPT only uses numerical financial information as its input, these results align well with [52], [53] that contextual information becomes relatively more important when firms experience losses and their size is small.↩︎
Note that errors should be unpredictable if forecasts are unbiased and efficient.↩︎
The maximum allowed training epochs is set to 50 yet none of the models hit this limit.↩︎
GPT outperforms stepwise logistic predictions at 1% level. However, the difference between GPT and ANN performance is not statistically significant at conventional levels.↩︎
We discuss this potential issue more formally in Section 6.1.↩︎
This result corroborates [52], who find that the informational value of narrative context in predicting future earnings has increased over time.↩︎
Compared to de-identified financial statement data, anonymizing textual data is conceptually more challenging. Textual data, such as earnings calls, may still retain sufficient contextual information that potentially allows the model to guess the anonymized firm.↩︎
Refer to Appendix D for computing the accuracy of a random guess.↩︎
We use the last hidden stage vector of the CLS token associated with a given narrative. In case the narrative exceeds 512 tokens, we partition the text into chunks and take the average over chunk-specific vectors.↩︎
ReLU activation function is used for the first two layers and the sigmoid function is used in the last layer. We minimize cross-entropy loss and use the Adam optimizer. As in our main ANN model, we use rolling five-year windows to train the model. We use a batch size of 128 and the model stops training when there is no improvement for five consecutive training epochs. We perform a grid search of nine iterations, using three values of learning rates (\(1e^{-5}, 1e^{-3},\) and \(1e^{-1}\)) and three values of dropout rates (0, 0.2, and 0.4), on random 20% of the training sample.↩︎
Accounting for 10 basis points in transaction costs, the GPT-based equal-weighted portfolio yields a Sharpe ratio of 2.84. The value-weighted portfolio yields a Sharpe ratio of 0.95.↩︎

Financial Statement Analysis with Large Language Models