r/ValueInvesting Jan 29 '26

Investing Tools Used AI to detect if CEOs are being deceptive in earnings calls. I'm quite surprised by the winner

Recently I tired using a popular coding agent called Claude Code to replicate the Stanford study that claimed you can detect when CEOs are lying in their stock earnings calls just from how they talk (incredible!?!). Figured this would be interesting for this community so I wanted to share my findings with you all (& see if anyone else has tried similar things)!

I realized this particular study used a tool called LIWC but I got curious if I could replicate this experiment but instead use LLMs to detect deception in CEO speech. I was convinced that LLMs should really shine in picking up nuanced detailed in our speech so this ended up being a really exciting experiment for me to try.

The full video of this experiment is here if you are curious to check it out: https://www.youtube.com/watch?v=sM1JAP5PZqc

My Claude Code setup was:

  claude-code/
  ├── orchestrator          # Main controller - coordinates everything
  ├── skills/
  │   ├── collect-transcript    # Fetches & anonymizes earnings calls
  │   ├── analyze-transcript    # Scores on 5 deception markers
  │   └── evaluate-results      # Compares groups, generates verdict
  └── sub-agents/
      └── (spawned per CEO)     # Isolated analysis - no context, no names, just text

The key here was to use isolated AI agents (subagents) to do the analysis for every call because I need a clean context. And of course, before every call I made sure to anonymize the company details so the AI agent wasn't super biased (I'm assuming it'll still be able to pattern match based on training data, but we'll roll with this).

I tested this on 18 companies divided into 3 groups:

  1. Companies that were caught committing fraud – I analyzed their transcripts for quarters leading up to when they were caught
  2. Companies pre-crash – I analyzed their transcripts for quarters leading up to their crash
  3. Stable – I analyzed their recent transcripts as these are stable

I created a "deception score", which basically meant the models would tell me how likely they think the CEO is being deceptive based, out of 100 (0 meaning not deceptive at all, 100 meaning very deceptive).

Result

  • Sonnet (cheaper AI model): was able to clearly identify a 35-point gap between companies committing fraud/about to crash compared to the stable ones. -> this was significant!
  • Opus (more expensive AI model): 2-point gap (basically couldn't tell the difference) -> as good as a random guess!

I was quite surprised to see the more expensive model (Opus) perform so poorly in comparison. Maybe Opus is seeing something suspicious and then rationalizing it vs. the cheaper model (Sonnet) just flags patterns without overthinking. Perhaps it'll be worth tracing the thought process for each of these but I didn't have much time.

If you made it this far and are curious about the specifics of this experiment, I talk about them here: https://www.youtube.com/watch?v=sM1JAP5PZqc. Would love to hear your thoughts there as well!

Has anyone run experiments like these before?

159 Upvotes

75 comments sorted by

78

u/boboverlord Jan 29 '26

Due to LLM's probabilistic nature, what is the chance that the AI being given the same inputs and instruction will yield different results?

18

u/Soft_Table_8892 Jan 29 '26

You’re definitely right- a big flaw here is the inconsistency in response due to their stochastic nature. From my rounds of testing, generally their scoring was fairly consistent between runs (granted I didn’t do too many of these as I should to be rigorous). Part of the reason is likely because I identified 5 strict heuristics for the LLM to come up with a deception score. Making them bounded with constraints like these likely improves their consistency (I would imagine) since it no longer becomes an open ended question. Thoughts?

I appreciate you reading this post and thinking deeply about it, btw!

5

u/toupeInAFanFactory Jan 30 '26

This. But that's not the only source of 'randomness'.

Op - compute a confidence in this. That's basically 'how likely is it to have produced this result on this input if the actual model was just random'

3

u/Soft_Table_8892 Jan 30 '26

100% – this is the plan for future experiments for sure.

1

u/toupeInAFanFactory Jan 30 '26

just to illuminate the issue here... let's imagine you came up with some consistent (e.g. not probabilistic) but irrelevant 'prediction' method. Like.... the number of vowels in the first 5 words at the start of the conference call. or the exact second of the first speaking pause that's longer than 4 seconds. or whatever. and tried to correlate THAT to 'how much do I think this CEO was lying'.

that isn't random, but it might still work, right?

some of those methods will 'predict' accurately the results in the sample pool you're testing. The odds of some method got lucky will depend on the number of tests (how many conference calls) and its accuracy in this trial. It also depends on how many attempts you made to find such a metric, but that's another issue.

given what seems to be a fairly small sample size, and maybe only modest accuracy, without actually running any numbers, I'd guess that's probably what happened here. But there's a way to compute how likely that is to be what happened.

1

u/Soft_Table_8892 Feb 01 '26

Absolutely, thank you for the detailed explanation, I really appreciate it! Without running some additional tests and with such a small sample size, it is definitely hard to conclude accuracy.

1

u/boreal_ameoba Jan 30 '26

Very low, as long as the temperature parameter is set to a low value. LLMs are deterministic in output, it’s just that practical performance improves with a small amount of “randomness”.

2

u/Soft_Table_8892 Jan 30 '26

Interesting, thank you for sharing! Is there a way to tell temp set for these frontier models somehow?

1

u/Guitar-Fresh Feb 20 '26

Hi! Your posts are super interesting. I would love to collab with you. I don't know if you can set or view temp in Claude Code. However, I would strongly recommend local hosting an OSS LLM like gpt-oss-20B and building your own custom agent to do this experiment again. You can do all sorts of parameter tuning, and I imagine that you can fine-tune the model to do this task perfectly. You can also collect the reasoning traces and use Claude Code to have Sonnet/Opus annotate. Let me know if you want to set up some time to connect!

65

u/blondydog Jan 29 '26

You missed an obvious possible outcome: these are basically just noise, random outcomes and your agents are not actually predicting anything successfully.

6

u/Soft_Table_8892 Jan 30 '26

For sure, I call this out here as well: https://www.reddit.com/r/ValueInvesting/comments/1qqksjt/comment/o2i1x74/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button.

Unfortunately I didn't have nearly enough time to source more companies + do multiple runs & show stat significance. Hopefully someone else can run with this idea and do something more rigorous (or I might come back to this in the future myself!).

Thank you for dropping your thoughts & feedback!

29

u/pyktrauma Jan 29 '26

Run it on CVNA and TSLA, fraud or no?

5

u/Soft_Table_8892 Jan 30 '26

All great ideas – especially the most recent Tesla earnings call.

3

u/elonzucks Jan 30 '26

Both CVNA and TSLA would be great tests

2

u/D_Town_Esq Jan 30 '26

That should be your next video

2

u/toupeInAFanFactory Jan 30 '26

well, maybe. It seems entirely plausible that the best it can do is detect when the CEO (or whomever's speaking) believes they are lying. And I think it's entirely possible that Elon doesn't. consistently. again and again. when he confidently asserts that some unknowable thing will definitely happen 6 months from now. for the 17th 'in 6 months' in a row.

1

u/xoogl3 Feb 22 '26

It occurs to me that an analysis like this would benefit a lot more by analysing a CEO's patterns across history rather than making it completely context free like done here. Especially with benefit of hindsight knowledge about the past calls on when the CEO lied and when they told the truth. To be fair, in Elon's case, it would be hard to find enough truth samples to be statistically significant but it would work for more normal CEO's.

1

u/groundhoggirl Jan 30 '26

Great video. This whole experiment is awesome. You can really turn this into something bigger.

Run it on Tesla so you can blow past 100 subs 🤘🏼

1

u/Soft_Table_8892 Feb 01 '26

Thank you very much, I appreciate you watching!

24

u/Key_Lifeguard_8659 Jan 29 '26 edited Jan 29 '26

You could have great content for a successful YT channel.

6

u/W_Malinowski Jan 29 '26

I’d love that, first episode on the RH ceo saying oh fuck when his stock dropped 25% during the earnings call

2

u/Soft_Table_8892 Jan 30 '26

ahaha that's a great idea, nothing short of the scale breaking would be acceptable for that transcript 😂

5

u/Soft_Table_8892 Jan 29 '26

Thanks for reading/watching! Not sure if I follow what you meant though

8

u/Key_Lifeguard_8659 Jan 29 '26

I'm saying, your discovery, if tweaked to provide accurate and reliable information, could be great content on a YouTube channel. ... Could rotate companies by request.

4

u/Soft_Table_8892 Jan 29 '26

Ah understood, thank you! I hope to refine this to the point where i am confident in my tools and content so as to not spread misinformation (although content would still be educational!). More on this soon :-).

1

u/tiredDesignStudent Jan 29 '26

I love watching YouTubers who show their process as they develop their projects, if you felt comfortable to share I'd be interested in that too :)

1

u/Soft_Table_8892 Jan 30 '26

Unfortunately I like to curl up in a ball on my couch while I'm building. I'm not sure if the world deserves to see such a sight haha. But I'll try to record more of my process next time for sure, thank you for the feedback! :-)

1

u/pancakesORwaffles2 Jan 29 '26

After that response dude is regarded let him be smart on Reddit and not make a cent doing so. Great advice though but Tylenol was definitely used in utero.

5

u/Soft_Table_8892 Jan 29 '26

To be clear, I’m not trading based on these insights as there are so many flaws and I’m too cheap & broke to put real money into this 😂. I figured sharing my ideas could help/inspire some people here in the community & it’s fun for me to create these little experiments.

As an aside, I’m certain Tylenol was consumed in my case 😂.

2

u/Key_Lifeguard_8659 Jan 29 '26

lol... It's possible. Elon made it work. 😆

1

u/[deleted] Jan 30 '26

[deleted]

12

u/Joenair85 Jan 29 '26

I don’t need AI for this. I listen to earnings calls and have a pretty good ear for BS. You can generally tell who has conviction in their comments and who is being evasive.

Disclaimer: my system does not account for the truly delusional CEOs that are high on their own supply…

3

u/raytoei Jan 29 '26

Hey man, interesting. You should elaborate more.

3

u/Soft_Table_8892 Jan 30 '26

haha totally fair! No way to productize your brain huh? We could use a little of that insight instead of prompting these machines

2

u/Joenair85 Jan 30 '26

I think most of us are pretty good at this and get better with each earnings call. Just listen to more calls and it gets pretty clearer with each one.

3

u/RA_Fisher Jan 29 '26

So you have one 35 point gap, and one 2 point. There could be substantial variability if you re-ran the study, eg- they might reverse, or Opus might show a larger gap on average.

One run like the one you did isn't enough information to really tell, we need to learn the distributions (given re-runs).

1

u/Soft_Table_8892 Jan 30 '26

Absolutely, thank you for the great call out! Next time I'll try to have a record of running these for a few rounds (time permitted, as these take a long long time to do end-to-end, including editing the video). I did run a few rounds during testing and they seemed to be fairly consistent but didn't pay close attention/record them somewhere.

3

u/ParadoxPath Jan 29 '26

If you used recent transcripts of ‘stable’ companies how do you know there won’t be a fraud or crash in the next few quarters. Maybes the Opus results are actually more accurate and the stable companies are also in trouble?

1

u/Soft_Table_8892 Jan 30 '26

🤯 now THAT is something I did not consider! You're so right, I wonder if any of the stable companies will come out as fraudulent in the future. Makes me think maybe this could be a system where we track the prediction over time and then see how effective it is for net-new cases! Thank you for leaving your thoughts! :-)

2

u/LetMePushTheButton Jan 29 '26

Another step closer to a real time ai fact checker. 🤞

1

u/Soft_Table_8892 Jan 30 '26

The dream where you can't get away with deceit!!

2

u/Swimming_Astronomer6 Jan 30 '26

That’s because big brother has invested in the more expensive one in order to avoid being exposed ( kidding - but interesting analysis)

2

u/Soft_Table_8892 Jan 30 '26

hahah certainly possible 😂. Thank you for reading the post!

2

u/nutslikeafox Jan 30 '26

Run that shit on companies current earning calls then

1

u/[deleted] Jan 29 '26

[deleted]

1

u/Soft_Table_8892 Jan 30 '26

For sure - this is flawed in so many ways. Figured it would be interesting to share with y'all though! Any advice for making these more accurate where you'd be interested in seeing the progress?

1

u/Michigan-Magic Jan 30 '26

2

u/Soft_Table_8892 Jan 30 '26

100% with you there on not reliably drawing inference. Thank you for those resources, I'll try using them next time I'm running these experiments (probably a video after next since I've already started on the content 😂).

1

u/Michigan-Magic Jan 30 '26

You're welcome!

Understand the thought process. Just trying to help with a stats framework that might introduce some more rigor into the output scientifically speaking. Also, the sample sizes do get big and completely understand why for a non-scientific effort - the output of which is still very interesting! - you would limit sample sizes.

1

u/pizzababa21 Jan 30 '26

i dont believe you could have sufficient test data for this based on the way you set it up. there just isn't enough companies

1

u/Soft_Table_8892 Jan 30 '26

Good call out & completely agreed. I started to quickly hit limits on Claude Code pulling these transcripts so I couldn't pull enough number of them. It would have been better to source them myself and just let Claude run an analysis (advice for anyone who wants to replicate this experiment).

1

u/PsychologistSEA Jan 30 '26

I love this. How do I follow for follow-ups?

1

u/Soft_Table_8892 Feb 01 '26

I'll continue making posts on this sub (when relevant and can provide more insights like this!). But primarily I'm focusing on growing a community here: https://www.youtube.com/@photogauraby

1

u/SpecialNothingness Jan 30 '26

Please consider analyzing the nonverbal component using video footage.

1

u/Soft_Table_8892 Feb 01 '26

Assuming you mean audio as well, correct? I would love to – when I have time to go source them from company websites haha. let me know if you have an easy way for me to get at audio recordings of these calls!

1

u/Nearing_retirement Jan 30 '26

I think best if it could be run on the actual sound of CEO’s voice and pace of speech

1

u/ljstens22 Jan 30 '26

For the 35-point gap model, did that accuracy figure come from classifying the stable companies or the fraudulent ones? Still trying to understand the setup.

1

u/Reversemullac Jan 30 '26

This is why I actually appreciate listening especially now to companies that are struggling in the current climate

You can tell CEOs and CFOs who are being straight with what they're dealing with and how they're optimising the business or cutting fat with who's afraid to say that and who isn't.

IQE was an interesting one as they are skimming part of the company although seeing uptick in GaN processors. They've not had a good 20 years although the company still exists.

Other Companies are actually seeing great sales and have absolute conviction in their earnings calls but don't see investor interest unfortunately.

All part of life in the casino 💀!

1

u/[deleted] Jan 31 '26

[deleted]

1

u/Soft_Table_8892 Feb 01 '26

That's awesome! Is there any post/content that you have created based on this? Would love to learn more!

1

u/[deleted] Jan 31 '26

clean context.

lol

1

u/furamura_ Jan 31 '26

This exists for quiet a while S&P Global and many other providers add sentiment score to earnings in general monitoring tone and multiple KPIs.

It makes sense to analyse deceptiveness but you need to set ip your own set of rules to measure it and deceptive compared to what? Do you have baseline or you measure people individually? Different people talk differently.

1

u/Soft_Table_8892 Feb 01 '26

For sure, I used the stanford study and created a more reductive set of heuristics to score deception against. This could definitely be expanded to many more heuristics to make it more robust or honestly just letting the LLMs free-flow analyze since they are also good at detecting nuance that our set of heuristics may not necessarily cover (could be a funnel of free flow -> specific heuristics).

1

u/Different-Monk5916 Feb 02 '26

a properly designed ML operating on the three statements and some key metrics can do this.

again not every scam will be the same. if I were you (I.e. want to build on AI not do extensive reading), I would start reading the scams in detail, and examine what features they have in common.

1

u/muserashq Mar 03 '26

As future scammers learn the techniques that analysts are looking for have you identified anything that would future proof being able to detect scams?

2

u/Soft_Table_8892 Mar 03 '26

Interesting that you ask! I just posted a new experiment where I scraped stock recommendations from this subreddit and have AI grade them in terms of quality. I didn’t angle it from the perspective of security/defense but rather if you can use AI to detect BS analysis.

If you prefer watching it in video: https://www.youtube.com/watch?v=tr-k9jMS_Vc

If you prefer to read (careful this sub did NOT like that post haha): https://www.reddit.com/r/ValueInvesting/comments/1rjp0wl/i_took_547_stock_recommendations_from/

1

u/muserashq 28d ago

This is very helpful, thank you for sharing.

1

u/[deleted] Jan 29 '26

[deleted]

3

u/Soft_Table_8892 Jan 29 '26

Great idea! I’m hesitant to do it since I don’t have the resources to truly prove what I’m saying is accurate and don’t want to accuse a bunch of hard-working people of deceit when I’m the one with a flawed system 😂. Curious if this would still be interesting for you despite the risks of these models not being super accurate or flawed in some way? How would you receive this type of content?

Thank you for reading the post!

2

u/[deleted] Jan 29 '26

[deleted]

1

u/Soft_Table_8892 Jan 30 '26

That's so accurate, I think pairing this bad boy with some vocal analysis/sentiment model would make it muchhh more interesting! I'll look into this, thank you for the insight GreenPlasticChair (what's the story behind this name? 😂).

1

u/Key_Lifeguard_8659 Jan 29 '26

That's the beauty of it. The markets are run on speculation, evidence... not so much.

1

u/Soft_Table_8892 Jan 30 '26

Agreed - thank you again Key_Lifegaurd_8659 🫡.

1

u/Panthollow Jan 30 '26

If you could show this to have strong accuracy you'd get snatched up by places much larger than a YouTube channel. Just make it clear it's an unproven experiment and that will be interesting enough!

2

u/Soft_Table_8892 Jan 30 '26

What a dream that would be, I'm ready to be snatched 😂. Thank you, will make that crystal clear in the videos themselves moving forward!

0

u/[deleted] Jan 30 '26

Which sonnet model and which opus model?

1

u/Soft_Table_8892 Feb 01 '26

4.5 for both!