I have a useful test to evaluate claims on LLM’s ability to “understand“.
This example shows that at this current moment, even one of the smartest LLMs, GPT-4o, is truly “merely” a heap of statistical operations and is not as “intelligent” as humans.
Here I asked [1] GPT-4o to merge the different definitions I have extracted from 8 articles by Sean McClure (gist here: concat_transcript-ul-mcclure.txt).
My prompt evolved from:
Here are concatenated list of concepts and definitions from several essays, please merge into an exhaustive and comprehensive list
To:
Here are concatenated list of concepts and definitions from several essays, please process them all in lower case, then MERGE and sort the list alphabetically. Remember, MERGE both the concept and the definition so I have an exhaustive and comprehensive list
To being as explicit as I can:
Here are concatenated list of concepts and definitions from several essays. There will be overlaps and redundant concepts across the different sections, as well as the same concept labeled in different names / many slight variations of the name. Please FIRST convert them all into lower case, MERGE them, and then sort the list alphabetically, and turn the concepts back into their original cases. Remember, MERGE both the concept and the definition so I have an exhaustive and comprehensive list
This is the final result: ul-mcclure-prompt-v3-gpt-4o.md
Do you see what’s wrong with it?
It is still not able to conceptually “merge” items like Invariance, High-Level Thinking, Nature’s Deadline, Occam’s Razor, Structural Complexity, Structure, and Staggering. It doesn’t “know” or “understand” that these are the same concepts.
I thought asking it to do something hackish like first converting the text to lower case will force them to have closer low-level representation and would help it recognise the same concepts.
And no, asking it to further process the final aggregated list (some manual two-shots prompting) doesn’t work either. I tried running the (still) duplicated output of the last prompt against this final prompt:
Here is a list of concepts and definitions. There will be overlaps and redundant concepts, i.e. the same concept labeled in different / slight variations of the name. Remember, MERGE both the concept and the definition so I have an exhaustive and comprehensive list
And here’s the output from that final-desperate-attempt: ul-mcclure-prompt-v4-two_shots-gpt-4o.md.
Before encountering this case, when I tried to tackle questions such as “Does an LLM understand the words it processes? Can it reason? Is it intelligent? Does it matter as long as it appear intelligent?” I would tend to revert to the line of thinking related to the relativity of definitions and reductive approaches such as “What is understanding? What is intelligence?“.
But I now have a concrete case to demonstrate the kind of understanding that LLMs currently cannot (yet) achieve at this stage of their evolution, and not just resorting to some viral doxa of “I heard that this and that LLM all failed in solving some basic (cherry-picked) logic riddles“.
And unlike the riddles, this use case is close enough to common practical usages by knowledge workers, akin to technical interview vs the actual job.
Ultimately, I had been trying to answer this question: to what degree can I trust and use the current state of LLMs to extend which cognitive activities? At this moment, I conclude: I better use it to throw interesting combination of words and ideas at me, but I need to still be the alchemist.
Note: All the definition aggregation are done in one chunk, so the repeating of concepts wasn’t caused by limitations of context window. That’s also why it’s able to properly alphabetically sort the items — something I recall was not something earlier LLMs could do reliably.
I haven’t tried the supposedly SoTA Claude Sonnet 3.5 tho, anyone curious to try? You can use nootroscripts to make the process easier.
I used my LLM scripts to generate the outputs here.
Originally published to Proses.ID.