Incremental AGI

abstract exponential art

“I think AI agentic workflows will drive massive AI progress this year — perhaps even more than the next generation of foundation models. This is an important trend, and I urge everyone who works in AI to pay attention to it.” – Andrew Ng 2024-03-21

In a recent interview, Sam Altman indicated that GPT-5, while a huge improvement on GPT-4, is still just an incremental step towards AGI. Despite a laughable amount of hype, software agents are not yet capable of completing general tasks reliably yet. So, the key question is where are agents lacking? Below are some of the key components that industry engineers and researchers are improving incrementally.

What is AGI?

For the purposes of this post, AGI (Artificial General Intelligence) refers to software agents that handle open-ended requests from users with real time clarification, planning and working memory. In practical terms, this entails using an LMM to process english inputs and make decisions while employing APIs and software functions (tools) to accomplish a task. 

Reference Architectures

We can draw inspiration from these projects.


OpenAIs custom GPTs are custom web chat apps hosted on a web page. They have tools like web search. Can index documents that you upload. They can be personalized (without writing code) by providing instructions, additional knowledge, and selecting functionalities such as web searching, image creation, or data analysis.

Limitations: 20 files, can’t see user interactions. The effectiveness of a custom GPT is largely dependent on the user’s ability to clearly define and communicate the desired functions and scope of the AI


AutoGPT is an open-source AI agent that can autonomously work to achieve a goal set by the user.

Key points about Auto-GPT:

  • It is based on OpenAI’s GPT-4 and GPT-3.5 language models, making it one of the first applications to use GPT-4 for autonomous tasks. 
  • The user provides Auto-GPT with a high-level goal, and it will break that down into sub-tasks, using the internet and other tools to try to achieve the objective without further user input. 
  • Some example use cases include automating software development, conducting market research, generating business plans, and creating content like blogs and podcasts. 
  • Auto-GPT has limitations, such as a tendency to make mistakes, lack of long-term memory, and high computational costs from constantly calling the GPT-4 API.

Limitations: High cost (lots of LMM calls), limited set of functions,inability of Auto-GPT to convert a chain of actions into a reusable function means you’re starting from scratch every time. can get stuck in loops.

Open Interpreter

Open Interpreter is a command line tool for writing and executing code on your local computer.

Its local execution environment, open-source nature, and comprehensive feature set position it as a valuable tool for a wide range of coding tasks, from educational purposes to advanced data analysis and automation. Key features like multi-language support, interactive programming sessions, data handling capabilities, and configuration flexibility make it a versatile tool for users with varying programming skills and interests. Overall, Open Interpreter aims to simplify the coding process, promote community contributions, and provide a user-friendly and powerful development experience.

Limitations: no GUI yet. Designed to operate GUI, API tools lacking. Generates Python only. 


GPT-Engineer is a customizable, community-driven tool that simplifies software development. It generates codebases from natural language prompts and engages in conversational interactions to clarify instructions. The tool supports customization, integration, and primarily JavaScript but can be adapted to other languages. GPT-Engineer’s modular architecture enables easy extension and customization. The software creation process involves two phases: Requirements Refinement Facilitation and Software Build.

Limitations: Users need to review the code it generates, mostly a 1-shot project generator.


Devin is a closed source application developed by Cognition Labs, described as the world’s first fully autonomous AI software engineer. Devin can write and deploy complete source code, handle debugging, deployment, fix major repositories, and generate real-time models based on existing algorithms. Devin operates through a chatbot-style interface where users prompt the code they need, and the AI takes over to create a detailed plan to solve the problem. It utilizes developer tools, writes code, resolves issues, tests, and provides real-time progress reports for users to monitor. Users can also intervene if corrections are needed. Devin’s capabilities include learning from unfamiliar technologies, deploying end-to-end applications, fine-tuning AI models, fixing bugs in codebases, contributing to open-source repositories, and even completing real jobs on platforms like Upwork.

Limitations: Slow, not actually available for users yet.


What are the key components of these architectures?

Chat interface

Language models (LMMs) simplify the creation of chat interfaces by generating human-like responses. However, developers still need to craft “scenario” prompts that motivate the LMM to clarify user queries interactively. For example, a customer support agent might have a prompt like: “You are a helpful customer support representative. Engage with the user to understand their issue, ask clarifying questions if needed, and provide a clear solution or escalate to a human agent if necessary.”


LMMs excel at generating to-do lists, but the challenge lies in continuously re-generating and updating them as the agent progresses towards its goal. For instance, an event planning agent might generate an initial to-do list, but as tasks are completed or new information arises, it must adapt and reprioritize the remaining steps to ensure successful event execution.

Tool selection 

Retrieval Augmented Generation (RAG) is commonly used for tool selection. It involves using vector search to find relevant functions and then having the LMM choose the most appropriate one to execute. For example, a data analysis agent might search for functions related to data cleaning, visualization, and statistical analysis, and then select the optimal combination of tools based on the specific dataset and analysis goals.

Skill creation

Like planning, generating code with imperfect knowledge poses challenges. LMMs can assist in skill creation by suggesting code snippets or templates, but the agent must be able to refine and adapt the generated code to fit the specific context and requirements. For instance, a web scraping agent might use an LMM to generate a basic script, but it would need to modify the code to handle edge cases, pagination, or site-specific quirks.

Working memory

Given the limited context available to LMMs, it is crucial to prioritize and retain only relevant information. Irrelevant or incorrect information can mislead the LMM and hinder its performance. A conversational agent, for example, should focus on maintaining context related to the current user’s query, while discarding or deprioritizing information from previous interactions that are no longer relevant.

Mechanism for self-improvement

The architecture’s ability to improve itself can vary. Some agents may rely on human feedback and updates to their skill library, while others employ machine learning techniques such as reinforcement learning, supervised learning, or unsupervised learning. For instance, a recommendation agent might use reinforcement learning to optimize its suggestions based on user feedback, or a language translation agent could utilize supervised learning to improve its accuracy by training on a large corpus of parallel texts.


LMMs can be leveraged for various reasoning tasks, such as recognizing correct answers, deciding on appropriate actions, parsing unstructured data, and generating structured data. For example, a question-answering agent might use an LMM to understand the context of a question, identify relevant information from a knowledge base, and generate a coherent and accurate response.


Ideally, software agents should be proactive in suggesting tasks or actions to be taken. This requires the agent to continuously assess its environment, goals, and available resources to determine the most appropriate course of action. For instance, a personal assistant agent might proactively suggest scheduling a meeting based on the user’s calendar availability and the urgency of the matter at hand.

What’s Missing?

Why do these systems fall short today? It is more than “GPT is not good enough,” each of the components and processes above will need to improve as well.

  1. Attention: Even if today’s LMMs can recall specific items from a large prompt context, they can’t apply everything from a single shot. You can’t give it a library of quality code and ask it to write more code utilizing that code. The models struggle to maintain focus on the most relevant information and apply it effectively to the task at hand. This limitation hinders their ability to learn from and replicate high-quality code consistently.
  2. Deeper world knowledge: It is difficult for agents to assimilate information from different sources and determine what is correct. Searching the web for more contextual information is error-prone. They struggle to reconcile conflicting information or identify the most reliable sources. This lack of deep understanding of the world limits their ability to make informed decisions and provide accurate responses.
  3. Iterative-generation: Current agents often generate responses in a single pass, without the ability to outline a response before making it. They cannot break down a complex task into smaller, manageable steps and iteratively refine their output. This limitation can lead to inconsistent or incomplete results, especially for tasks that require multiple steps or careful planning.
  4. World-inspection skills: Software agents often lack the ability to thoroughly examine and understand their environment. They may struggle to navigate complex file structures, interpret error messages, or identify relevant information in logs or documentation. This limited world-inspection capability hinders their ability to diagnose issues, find relevant information, and adapt to new situations.
  5. Depth of planning: The context window limitation and insufficient working memory (or more accurately, attention to working memory) prevent agents from making good decisions based on long-term goals. They may struggle to break down a complex task into a series of smaller, achievable steps and adapt their plan based on intermediate results. This lack of deep planning ability limits their effectiveness in handling multi-step tasks or projects that require long-term reasoning.
  6. Code-correctness: When generating code, agents can make mistakes with regard to APIs being used, input/output formats, and other syntax or semantic issues. They may struggle to understand the nuances of different programming languages or frameworks, leading to code that is not fully functional or efficient. This limitation highlights the need for improved code understanding and generation capabilities.
  7. Fine-tuning: Fine-tuning techniques are still probably under-utilized because architectures are changing rapidly. As new architectures emerge, it takes time to develop effective fine-tuning strategies that can optimize the performance of software agents for specific tasks or domains. This lag in fine-tuning adoption can limit the potential of software agents to excel in specialized areas.
  8. Background Processes: The agents above rely on prompts from users to inspire them to do work. They lack the ability to continuously monitor their environment, identify tasks that need to be done, and proactively initiate work without human intervention. For example, a truly autonomous software agent should be able to scan a codebase, identify potential improvements or bugs, and start working on them without being explicitly asked.

Important research areas

Where is it most valuable to focus? Where are the most gains to be had?

1. Reliability

Where possible, we need to turn LMM prompt-chains into deterministic building blocks (make them robust to real world noise).

  • Fine-tuning techniques, such as domain adaptation and task-specific training.
  • Enforcing structured output, such as using templates or domain-specific languages.
  • Developing robust evaluation metrics and benchmarks.
  • Low-level support for evaluating success / failure and retrying (possibly other strategies).

2. Agentness

  • Adding skills to interact with their environment, choose actions, and evaluate outcomes.
  • Better search (for context or expertise), summarize (to select the relevant bits), and decide.

3. Retrieval and ranking

  • Content-aware embedding choices.
  • Developing better ranking algorithms, such as using learning-to-rank approaches or incorporating user feedback.
  • Exploring techniques for dynamic retrieval and ranking based on the evolving context of the task.

4. Prompting

  • Developing a systematic approach to prompt engineering, such as using templates, Prompt chaining, or meta-learning, can help streamline the process of finding the best prompts for a given task.
  • Incorporating operator feedback and iterative refinement can help improve the quality and effectiveness of prompts over time.
  • Exploring techniques for automatic prompt generation or adaptation based on the task context (it’s difficult to prompt an LMM to write code that contains another prompt).

5. Reinforcement Learning or other feedback mechanisms

  • Incorporating reinforcement learning or other feedback mechanisms.
  • Integrating user feedback and preferences into the learning process.

6. Dynamic memory and conceptual representations

  • Improving the ability of agents to store, retrieve, and update relevant information dynamically.
  • Developing more advanced memory architectures, such as using attention mechanisms or hierarchical representations.
  • Exploring techniques for building conceptual representations, such as using knowledge graphs or ontologies.
  • Incorporating mechanisms for incremental learning and knowledge consolidation.

Since the release of GPT-4 we’ve converged on a set of interesting ideas and use cases for how to build agents. However, the current generation of models is not yet capable of handling broad tasks in a completely independent way. 

By focusing research efforts on these key areas, we can expect significant advancements in the capabilities and performance of software agents. The solution that we’ve discovered is to combine the broad agent architecture with a narrow set of curated skills, and to focus on a very specific set of problems that are solvable with the LLMs we have access to today. This approach ensures that our agents can deliver value today but also be very well positioned to plug in any upcoming generations of LLMs that are likely to be able to handle increasingly complex tasks in a more independent manner.