Beyond the Curtain: Architecture, Memory, and the Illusion of Reasoning in LLMs

 

Beyond the Curtain: Architecture, Memory, and the Illusion of Reasoning in LLMs

By Austin Staton

 IT Professional & Bellevue University Alumnus

Introduction: The Bedrock of Continuous Learning

It has been just over a year since I walked across the stage at Bellevue University, and in that time, the professional landscape has shifted beneath our feet. Working in IT and data center operations for over eight years has taught me one immutable truth: the moment you stop learning is the moment you become obsolete. As a species, our existence is built on a bedrock of continuous learning, drawing on the discoveries of those who came before us, trusting one another, and acknowledging that we will never truly "know everything."

With the rapid proliferation of Large Language Models (LLMs) like Gemini, Claude, and GPT, I felt a personal necessity to look "behind the curtain." You know you think you know, but do you?

I wanted to understand not just what these tools have been used for, but how they actually function at a structural level. What I discovered is a fascinating contradiction: a system that feels remarkably human but operates on principles entirely foreign to biological reasoning. 

The "Static Brain" vs. Biological Learning

One of the most striking realizations in my research / Testing is that an LLM is not a "brain" in the biological sense. Nor did I think that, but the jumps in what they could do did start to make me question myself even haha.

When we learn, our neural pathways are constantly re-wiring; we build on previous knowledge in real time. In contrast, an LLM is a physical, static structure once its training is complete.

The model’s "intelligence" is stored in its weights, aka the billions of parameters baked in during the initial training set. I like to think of this as the model's ROM (Read-Only Memory). It is a sort of frozen snapshot of data. 

It cannot "learn" a new fact about a software update or a new data center protocol, what happened in the News today, etc, and it is just something permanently stored in its weights through a simple conversation. To truly "know" something new, the model must be retrained or fine-tuned, a process that is computationally massive and static. 

The Workspace: Tokens and the Context Window

To interact with us, the model uses what we call a Context Window. As an IT professional, I find it easiest to view this as System RAM. This is the model's working memory for a specific session. Everything you type, and everything it responds, is converted into tokens (numerical representations of text).

The context window has a hard limit a "cup" that can only hold so much water. Once you hit that token cap, the model begins to lose context. It might "forget" a detail from ten prompts ago because that data has been "pushed out" of the active RAM. While developers are working tirelessly to expand these windows, they remain a temporary workspace, not a permanent storage solution.

Bridging the Memory Gap: RAG and Persistent History

This leads to a common point of confusion: if the model is static and the context window is temporary, why does Gemini or ChatGPT seem to remember who I am across different days?  Magic?  Haha 

No, The answer isn't "long-term memory"; it’s a clever architectural bridge called Retrieval-Augmented Generation (RAG).

RAG allows the AI to use agentic search to pull in fresh information from the web or from your own persistent chat history. When we start a new session, the model performs a quick background "fetch" of a User Summary distilling key facts about you if it can see them, preferences for whatever XYZ you have talked about or have discussed. It then "shoves" this data into the current context window. When you make prompts that need to pull from the web it uses RAG to call other APIs to do the same thing but for shorter fetches 

It creates an illusion of memory, but structurally, it’s just a very efficient filing system. If  in my case if I were to delete my persistent chats or disconnect the model from its APIs, that "familiarity" would vanish instantly. The model would revert to its "base" training set, having no idea who I am or what projects we’ve worked on.

All that being said this takes cycles lots of them which are all based of needing Inference-driven processing. That is in turn is a lot that will balloon up fast if we don't figure out better was to approach it. 

The Reasoning Ceiling and Cognitive Debt

We all know AI is not by default right nor does it really know anything. The more you know about it the more questions you start to really put forward or at least you should haha, we must be wary of the "expert tone" these models adopt. As highlighted in recent critiques (such as those from The Fat Cat Report), LLMs are ultimately prediction engines. They are predicting the next most likely word based on patterns in vector data from weights it was trained on, not reasoning through a problem from first principles.

There is a growing concern regarding "cognitive debt." When we outsource our thinking to a machine that can only mimic reasoning, we risk losing our own ability to troubleshoot complex issues. For those of us in data centers, where a single broken logic chain can take down a network, this is a dangerous trade-off. Scaling these models with more data and more GPUs is hitting a point of diminishing returns; we cannot reach AGI (Artificial General Intelligence) simply by predicting the next word more accurately. Nor should that be the goal? At least in my opinion. 

Conclusion: The Human Element in a Predicted World

LLMs and All the other varied models out there are powerful tools I will not deny that, but they are not sentient assistants. They are sophisticated pattern matchers that rely on a complex stack of (Context),  (Weights), and (RAG) to function. They give the perception of memory and understanding as more features are shoved in, but that perception is only as deep as the current context window goes.

The bigger a context window gets the more chance you have to run out of tokens and the higher probability of hallucinations as things go on. Leading us back to the problem of having to constantly treat new chats to then give the context window a refresh. 

As I continue my journey through the IT field, I will keep using these models to optimize my work yes but with a hand on the trigger so to say. Doing so with the understanding that the true "intelligence" still resides in the human ability to question, to verify, and to learn. We must remain the architects of our own understanding, ensuring that while the AI predicts the next word, we are the ones defining the next breakthrough.

Otherwise we may give to much into heuristics to much which may lead us to a place that might not be the best. 



Sources Referenced:

  • KEY LIMITATIONS OF LLMS — AND HOW TO OVERCOME THEM | by Jeferson Chagas, Ph.D. | Medium https://share.google/cat9BbNSKA5dqEDlw

  • The Fat Cat Report, "LLMs Can't Reason And They Never Will" (YouTube) https://youtu.be/z2TH5ietCZQ?si=Mx-PECxXut-NqG-P

  • https://www.databricks.com/blog/long-context-rag-performance-llms

  • https://www.telegraph.co.uk/business/2025/06/17/using-ai-makes-you-stupid-researchers-find/

  • https://cloud.google.com/use-cases/retrieval-augmented-generation

Comments

Popular Posts