AI Automation Survey
Li Wei
Title: AI Automation Research
Background
LLMs write code quickly, but when faced with long‑term planning or complex context they tend to “hallucinate, dumb down, or lose direction.”
Bug fixing has always been a headache; it isn’t as systematic as a product requirement, yet it consumes a lot of human resources. To build an automated bug‑fixing agent, we cannot rely solely on the free‑form generation of large language models (LLMs). Instead, the code generation and repair process must be embedded in a hard workflow that features strong rules, deterministic validation, sandboxed execution, and a strict state‑machine transition. This constrains LLM hallucinations and guarantees reliable fixes.
Compound AI Systems: the core theoretical paradigm for current agent development.
- Core idea: Solving complex engineering problems cannot rely on “one huge model + one huge prompt (Omni‑Bot mode).” We need to construct a compound system.
- System components: The system should include: models (LLMs), external tools (compilers, linters, search), retrievers (RAG for browsing codebases), and control flow (state machines, if‑else rule interceptors).
Demo
Ideal implementation:
An automated bug‑handling pipeline that deals only with bugs. For example, when production or development discovers a bug in a project, it can be reported via Feishu (a Chinese enterprise chat app). The agent collects the bug into a list for unified management—essentially a to‑do list. Downstream agents can then pull unprocessed bugs, perform feasibility analysis, and verify the bug against the project. If it’s a false positive or cannot be fixed by code changes, the agent returns a response; if it can be fixed, the execution agent takes over: creates a branch, writes code, adds unit tests—acting as a development agent. Afterwards, a testing agent performs secondary processing, including deployment to a test environment for integration testing. Once integration passes, the operations agent handles production rollout, while a sentinel agent oversees the entire workflow. If any agent encounters an error or the process is halted, the task status in the list is updated and notifications are sent, providing global management and monitoring of the whole pipeline.
The automation pipeline should be plug‑in‑able, highly extensible, and strongly adaptable.
Demo concept:
Use a “Ralph loop” to let Claude Code repeatedly process unhandled bugs. After a bug is taken, it enters a supervisor state, spawns child agents with appropriate skills to perform analysis, development, and testing actions until the bug is resolved.
Issues
- Can this sub‑agent mode handle medium‑to‑large projects or changes that span multiple files? Would switching to a multi‑agent implementation become overly complex, and how can reliability be ensured?
- Need to investigate how the underlying cc framework implements sub‑agents.
- The pipeline emphasizes a one‑stop solution: how to configure different projects before processing, integrate Feishu and other apps to improve usability and lower the entry barrier, and after processing, how to hook into test environments for automatic integration testing.
- One possible flow: when a project reports a bug, submit a description and the related project, forming a table; the backend creates a group chat, streams real‑time progress, and finally produces a report containing the outcome, code branch, and test‑environment deployment URL.
- The current design does not consider the agent’s self‑learning ability. If any link in the chain fails, there should be a manual termination option with an explanation. Once an agent confirms success, a post‑mortem should be recorded, enabling the system to become smarter over time. Reference the Hermes agent for implementation ideas and integration considerations.
State Transitions
State enumeration:
| Phase | What’s done | Who does it |
|---|---|---|
| NEW | Just submitted, awaiting processing | Human |
| ANALYZING | Search code, determine bug validity | agent: bug-analyze |
| FIXING | Fix code on an isolated branch + accompanying tests, iterate until pytest passes | agent: bug-dev |
| TESTING | Independent code review + boundary testing, double‑check fix quality | agent: bug-test |
| DEPLOYING | Deploy the fix branch to a test environment | agent: bug-deploy |
| INTEGRATION | Run E2E / API integration tests, verify behavior in a real environment | agent: bug-integration |
| REPORTING | Summarize results of all stages (including intervention records), generate a fix report + branch info | Supervisor: bug-orchestrator |
| DONE | Final state: fix completed, report and branch ready for human review | — |
| REJECTED | Final state: determined not a bug | — |
| FAILED | A stage failed, attempt … | — |
State transition diagram:
(diagram omitted)
PAUSED Mechanism
Design principle:
Any state—including DONE and REJECTED—can be interrupted by a human. Humans cannot monitor the entire execution in real time, and post‑run audits may also uncover issues.
The essence of an interruption is the same: pause the process, record the issue, wait for a human decision, then resume or re‑judge. Therefore we use a unified PAUSED PAUSED status together with paused_stage / error_origin_stage fields, rather than defining separate interrupt states for each stage.
Purpose of PAUSED:
- Pause mechanism – stops the agent, awaiting human input.
- Quality‑improvement data entry – each manual intervention becomes an intervention record that drives continuous skill optimization.
Self‑learning flow:
(illustration omitted)
PAUSED‑related fields (proposed)
(cannot be displayed outside Feishu docs)
intervention field description:
(cannot be displayed outside Feishu docs)
An intervention that occurs in the DONE stage, with error_origin_stage pointing to the FIXING phase, indicates that the bug‑dev fix was flawed and bug‑test review also missed it – a single record can expose problems in two skills simultaneously.
root_cause_category classification and improvement direction
(cannot be displayed outside Feishu docs)
Projects
GSD (Get Shit Done) – Core constraint: Context isolation & environment purity
GSD started as a wildly popular prompt‑engineering framework on GitHub and has evolved into a TypeScript‑based powerful agent orchestration system (supporting Claude Code, Cursor, Copilot, and dozens of runtimes).
- Core pain point it solves: Context Rot (context degradation). When an agent repeatedly modifies code, reads errors, and modifies again within the same session, the token window grows, and the model’s performance drops sharply once the context occupies more than half of the window.
- Implementation mechanism (Hardness):
- Thorough context isolation: Large tasks are broken into Atomic Tasks. For each code‑modification task, GSD spins up a brand‑new, ultra‑clean 200 K token window that loads only the files absolutely relevant to that task.
- Wave‑based execution: Independent tasks run in parallel; dependent tasks are queued.
- Mandatory acceptance tests: Every atomic task must provide a verification command like
(e.g.,pytest). If the test fails, the task cannot modify global state (STATE.md).
GStack – Core constraint: Role awareness & decision perspective
GStack is an open‑source framework led by Y Combinator (YC) president Garry Tan (extremely popular in 2026). Its audience is developers who want to turn themselves into an entire R&D team.
- Core pain point it solves: Cognitive Mush (role confusion). If the same model makes architecture decisions and writes low‑level code, the output tends to be mediocre or off‑topic.
- Implementation mechanism:
- Cognitive Gears: Enforce strict directory structures (
.claude/skills) andCLAUDE.mdconfigurations to delineate roles. - Provides a set of slash‑command streams: e.g., thinking with
/office-hours, planning with/plan, reviewing with/review, releasing with/ship. - When the agent is in “Staff Engineer” mode, hard constraints are added: no arbitrary refactoring, must follow a specific design pattern. In “Review” mode, multiple adversarial sub‑agents are spawned to check correctness, maintainability, and security.
- Cognitive Gears: Enforce strict directory structures (
Multica.ai – Core constraint: Enterprise‑grade state management & lifecycle
This is more of a commercial/open‑source platform solution, focusing on deploying agents in real‑world enterprise settings.
- Core pain point it solves: Agents tend to drop off, enter infinite loops, or fail to integrate with existing corporate permissions and approval flows.
- Implementation mechanism:
- Long‑running autonomous state machine: Agents can run for hours uninterrupted, with all progress (State) persisted so that a network loss can be recovered.
- Robust governance: Define clear permission boundaries (e.g., read‑only Jira access, can only open PRs, never merge to main) and authentication.
Superpowers (another major Claude Code framework)
- Feature: Enforced TDD (test‑driven development) pipeline. The agent must write a test before any business code. This maps perfectly to bug‑fix scenarios: reproduce bug → write failing test → fix → test passes.
Others
SWE‑agent (Princeton University open source)
- Feature: Designed specifically for fixing GitHub Issues (i.e., bugs). It includes a dedicated toolset called
Agent-Computer Interface (ACI). For example, it prevents the LLM from arbitrarily reading files by providing a paginated file reader, and blocks unrestricted writes with a safe‑guardedsearch_and_replacefunction.
- Feature: Designed specifically for fixing GitHub Issues (i.e., bugs). It includes a dedicated toolset called
AutoCodeRover (National University of Singapore)
- Feature: Strong reliance on the program’s abstract syntax tree (AST). It combines semantic search with class/method structure analysis to pinpoint bug locations, greatly boosting success in the “diagnosis” stage.
Sago (open‑source agent orchestrator)
- Feature: An agent‑agnostic CLI orchestrator. Its philosophy: use a cheap model (e.g., Qwen or GPT‑4o‑mini) for planning, then hand the concrete execution plan to a powerful model (Claude 3.5 Sonnet / 3.7). Each step includes a mandatory
verification gate, helping reduce API costs.
- Feature: An agent‑agnostic CLI orchestrator. Its philosophy: use a cheap model (e.g., Qwen or GPT‑4o‑mini) for planning, then hand the concrete execution plan to a powerful model (Claude 3.5 Sonnet / 3.7). Each step includes a mandatory
Originally written by Li Wei (李唯_) and published in Chinese on 后端技术栈全书 (Full-Stack Backend Engineering). Translated and adapted for DriftSeas with permission.