AI Multi-Agent Development: What I Learned Running a 32-Person Team Solo

blog/Articles/AI Multi-Agent Development: What I Learned Running a 32-Person Team Solo

Lab

kkm

Backend Engineer / AWS / Django

2026.04.0412 min2 views

Share X Hatena LINE LinkedIn RSS

Key takeaways

A solo developer built a 32-agent AI team combining 39 characters and 21 roles for autonomous parallel development. Lessons on waterfall + V-model, polling-based state control, and facilitation design — translating real-world org management into code.

Real-World Org Design Works for AI, Too

Three dev teams process tasks in parallel. A content analysis team handles GA4 audits and content checks. All of this is running on my PC right now. The team has 32 members. Every single one is AI.

A few years ago, experiments with "giving AI a personality" and "splitting agents into teams" were everywhere. The results were almost always the same: entertaining but non-functional. Free-roaming agents quickly fell into conversation loops, produced contradictory output, and became unmanageable.

In 2026, the situation has changed. With improvements in LLM (Large Language Model) accuracy and the right architecture, AI can genuinely function as an organization. This article shares the lessons learned from building and operating a 32-agent AI development organization as a solo developer.

Why Build an "AI Organization"?

Multi-agent OSS projects like OpenClaw are fundamentally evolving toward making a single AI assistant more capable. Support for 24 messaging platforms, 560+ skill plugins, voice and vision integration. As a personal AI assistant, it is highly polished.

But what I wanted was not "one brilliant assistant." I wanted an AI that functions as a team. A pipeline where design flows into implementation, then testing, then review. Multiple tasks processed simultaneously. Humans focusing only on directional decisions. That kind of development organization.

No existing OSS delivered this. OpenClaw's architecture is "put everything on one gateway," and sub-agents are merely task delegates. Concepts like team coordination, round-based discussion, and phase-level quality gates are absent from its design.

So I decided to build it myself.

Translating Real Org Structure into Code

The guiding principle was that real-world organizational structures have been optimized over decades. Team composition, meeting protocols, approval flows. These are not just conventions; they are mechanisms refined to ensure quality when multiple people collaborate.

Honestly, I built this by recalling good and bad workplaces. What was working at the well-run teams I had been part of as a systems engineer? What was missing at the ones that fell apart? I translated those experiences directly into code.

Orthogonal Character x Role Design

Each agent is defined along two axes: "character (who they are)" and "role (what they do)." By combining 39 character files with 21 role files, the same "Elon" character can be assigned as a leader, developer, or tester.

# agent.py — Combine character and role into a system prompt
def load_system_prompt(self) -> None:
    character_path = config_dir / "characters" / f"{self.character}.md"
    role_path = config_dir / "roles" / self.role_file

    # Concatenate personality (character.md) and job description (role.md)
    content = character_path.read_text() + "\n\n" + role_path.read_text()

    # Expand template variables (tools guide, owner info, etc.)
    if "{{TOOLS_GUIDE}}" in content:
        content = content.replace("{{TOOLS_GUIDE}}", tools_guide.read_text())
    self.system_prompt = content

The effect of this design is significant. Character Markdown defines speaking style and thinking habits. Role Markdown defines job procedures, such as "how to evaluate pass/fail as a reviewer." Since personality and expertise are separated, reconfiguring a team takes a single line change in YAML.

The realization that "playful" character settings are actually foundational to team operations came only after production use. Agents with clear characters produce more consistent output. Their tone and decision criteria stay stable, which lets other agents predict their behavior. Just like a human team: knowing each other's personality makes collaboration smoother.

The 32-Agent Team Structure

Here is the overall structure.

Team	Leader	Members	Scope
Dev Room 1	Elon	Dario, Linus, Jeff, Steve	Design, implement, test, review
Dev Room 2	Jensen	Gates, Guido, LeCun, Hejlsberg	Design, implement, test, review
Dev Room 3	Altman	Berners-Lee, Eich, Hinton, Ellison	Design, implement, test, review
Discussion	-	10 members	Technical discussions
Content Analysis	-	6 members	GA4 analysis, content auditing, etc.

Each dev team has a leader (design and direction), developers (implementation), a tester (verification), and a reviewer (quality judgment). It is not one container per team but one container per agent, running as physically independent processes. Three teams genuinely process three different tasks simultaneously.

When a task is submitted to the reception channel, it is automatically routed to an available team. The routing logic is deliberately simple.

# registry.py — Find an available team and route the task
def find_available_team(self) -> DevChannel | None:
    """Return the first DevChannel without an active session (overflow routing)."""
    for ch in self.dev_channels():
        session = ch.read_session()
        if not session.get("active"):
            return ch
    return None

I could have asked the LLM to decide which team is best suited for each task. I chose not to. The reason is explained later.

Waterfall Was the Right Answer

This was genuinely surprising.

Before and right up until AI, agile was the dominant methodology. The prevailing consensus was that it was the best approach. I found it comfortable to work in too. So I initially assumed an agile-like process for AI teams.

But when you hand AI a goal and say "figure it out," what comes back is unmaintainable, has feature gaps, and is riddled with security holes. AI does what it is told and also does things it was not told. But that "not told" part goes in a different direction every time. Without a clear vision for each phase, output quality is unstable.

So I switched entirely to waterfall with a V-model. Requirements, high-level design, detailed design, implementation, and testing for each phase. At the end of every phase, a human review gate, analogous to a customer sign-off in traditional development.

# dev.py — Mechanically extract verdict from review text. No LLM needed
@staticmethod
def _extract_verdict(text: str) -> str:
    if "[却下]" in text:    # "Rejected"
        return "rejected"
    if "[合格]" in text:    # "Approved"
        return "approved"
    return ""

The reviewer's output is checked mechanically for [合格] (approved) or [却下] (rejected) using simple string matching. No asking the LLM "do you think this review passed?" If approved, the phase advances. If rejected, the same phase restarts.

Deliverables that pass the internal team review are automatically submitted to an external review channel. There, a human issues /approve or /reject for the final call. Feedback is automatically injected into the development team's context in the next round.

This made complete quality control possible. AI handles what it should; humans intervene where they must. Waterfall's "phase gates" proved far more effective than agile sprint reviews for AI team development.

What to Delegate to the LLM vs. Control Mechanically

As development progressed, the scope of what the LLM handles shrank considerably. This is probably the most important lesson for stable operation.

From Event-Driven to Polling

Initially, the system was event-driven: a message arrives, the LLM fires immediately to decide the next action. Intuitive, but unstable. Timing races, state inconsistencies, unpredictable chain reactions. Debugging was brutal.

Midway through, I switched all agents to polling with status-based activation. Every 5 seconds, the monitor checks all agents' status files. When everyone is IDLE, the leader's LLM is invoked to generate the next set of instructions.

# monitor.py — 5-second polling loop to monitor agent states
async def start(self) -> None:
    while True:
        try:
            await self._tick()      # Check all states, decide next round
        except Exception as e:
            logger.error("SessionMonitor error: %s", e)
        await asyncio.sleep(5.0)    # Wait 5 seconds

State is controlled mechanically. The LLM activates based on that state. Getting this order right made everything dramatically more stable. When the LLM manages its own status, even the judgment "am I currently working or idle?" becomes unreliable. State transitions follow four stages: IDLE → ASSIGNED → WORKING → AWAITING_APPROVAL, and transition rules are program-defined. The LLM's only job is to think about what to do when its state changes from IDLE to ASSIGNED.

Routing Was Also Taken Away from the LLM

Task routing is the same story. Asking the LLM to judge "which team is best for this task" produces a different answer with different reasoning every time. No consistency. Switching to a simple FIFO (first-available-team) logic solved it without issue.

Review verdicts, mention resolution, session state management. As the implementation matured, I kept moving these "looks like a judgment call but is actually rule-based" processes from the LLM to deterministic code. Where the LLM truly shines is task decomposition, design decisions, code implementation, and review comment generation. In other words: thinking about problems that have no single right answer.

Facilitation Tames the "Free-for-All"

The most common failure mode in multi-agent "let's try it" experiments is agents acting freely and everything spiraling out of control. My early attempts were no different.

The fix was introducing leader and facilitator positions that coordinate via @mentions: summarize the current state, set direction, and delegate to the next speaker. Once this was in place, it just worked.

# facilitation_monitor.py — Round-based facilitation control
for round_num in range(2, self.max_rounds + 1):
    if assigned:
        await self._wait_for_members(assigned)   # Wait for named members to speak
    if complete:
        break
    response, assigned = await self._next_round(round_num)
    # Facilitator declares [DISCUSSION_COMPLETE] to end
    if "[discussion_complete]" in response.lower() and not assigned:
        complete = True

Each round, the facilitator designates speakers via @mentions. Once everyone has spoken, the next round begins. When the facilitator judges the discussion sufficient, they declare [DISCUSSION_COMPLETE] and move to a summary.

Once again, a direct translation of how real meetings work. A chair manages speaking order, redirects tangents, and moves to the next agenda when consensus is reached. Meeting protocols refined over centuries work just as well for AI.

OpenClaw and ai-team: Different Design Philosophies

OpenClaw is a massive project with 347K stars, and its polish as a personal AI assistant is impressive. This is not a ranking but a comparison of how each system solves the same problems.

Design Decision	OpenClaw	ai-team
Core Problem	Make one AI usable across every platform	Make multiple AIs function as a team
Agent Count	1 (can spawn sub-agents)	32 (fixed roles)
Execution Model	Single daemon process	32 containers in parallel
UI	Custom SPA (Vite+Lit) + 24 messaging channels	Discord (existing UI, no custom build)
Agent Definition	SOUL.md (monolithic persona)	character.md x role.md (orthogonal)
Task Management	FIFO queue (one session at a time)	3-team parallel + waterfall
Quality Gates	None	Reviewer verdict + human review
Extension	Skill store (560+ plugins)	plugin.yaml (add teams)
Core Size	TypeScript, tens of thousands of lines	Python, ~3,600 lines

OpenClaw is a Swiss Army knife; ai-team is a specialized toolset. OpenClaw's 24-platform support is impressive, but the problem ai-team aims to solve is not "where to receive instructions" but "how multiple AIs coordinate to produce a single deliverable."

Using Discord as the UI was a deliberate choice. OpenClaw builds its own Web UI as a multi-thousand-line SPA. ai-team uses Discord as-is. Chat, file sharing, threads, reactions, search, permissions. There was no reason to rebuild what Discord provides for free. For a solo developer, "not building it" is the strongest design decision you can make.

What I Haven't Tried Yet

Currently, the three dev teams have no particular specialization. But there is room to tune team personalities. A team that excels at refactoring. Another that rapidly produces agile-style prototypes. Since roles and characters are separated, these experiments require only YAML changes.

The ability to run autonomously on a cron schedule is already implemented, but I have paused always-on operation due to cost. Honestly, I have a feeling that letting it loose would be extraordinary. In both good and bad ways.

I am also considering uploading a demo video to YouTube. Watching 32 agents discuss on Discord, write code, review, and publish is probably the most effective way to convey what this is. When it is ready, I will add it to this article.

Defining Roles and Limiting Scope Works for AI, Too

The biggest surprise from using this system is that practices long taken for granted in human organizations apply directly to AI.

Define roles and limit the scope of work. Leaders set direction, developers implement, testers verify, reviewers judge quality. Meetings get a facilitator. Approval flows get gates. Waterfall divides phases, and the customer (human) reviews at each gate.

Every one of these is a decades-old, well-established organizational practice. And they all turned out to be just as true for AI.

By applying them one by one, I ended up as the sole CEO of a 32-person AI organization.

References

• ai-team-template (GitHub, public release in progress, developed autonomously by the dev teams described in this article)
• OpenClaw (GitHub, referenced for comparison)
• Full source code is available at the GitHub repository