Pausing Limo
A postmortem on bootstrapping an DevOps startup in the age of AI
This is a sad but necessary post:
We’ve halted current development and user testing with Limo. There are a number of reasons why it didn’t work out. I’ll dive into each one in this post, and touch on what’s next for the team.
Product Market Fit
We ran into a few classic product pitfalls. A lot of these stemmed from relying too much on my own experience and intuition. As a flimsy defense, there probably is at least some value in those. Limo was born originally to help automate repetitive, error-prone tasks like setting up and maintaining infrastructure in AWS, ones that I found myself doing over and over again.
In reality, the particular pain points of DevOps seem to vary greatly because of its broad scope and definition. For example, this is how GitLab defines DevOps:
DevOps is a combination of software development (dev) and operations (ops). It is defined as a software engineering methodology which aims to integrate the work of development teams and operations teams by facilitating a culture of collaboration and shared responsibility.
What is a development team? An operations team? What is the difference between dev and ops? Which one better describes the process of writing Terraform or other types of Infrastructure as Code?
The answers to these questions are not categorical. There’s no single standard, or even 2-3 competing ones, that can be applied easily to describe all DevOps practices. Naively, this lack of standardization seemed like the problem to solve. I’ve seen so much time wasted at nearly every job I’ve had dealing with bespoke CI/CD pipelines, infrastructure as code, ways of searching logs, and handling on-call duties.
Unfortunately, we should have looked for a problem to solve, tied to real potential customers or users. Instead, we built working prototypes covering way too much ground: for infrastructure setup. Deployments. Connecting to AWS, Netlify, and Github (and connecting them to each other). We had user-authorable “routines” to easily automate repetitive tasks. An optimizer workflow to discover savings opportunities. A troubleshooting one to analyze and correlate log searches. A Slack plugin and full Web UI. An MCP integration.
It was a mess. Even with four full-time, relatively experienced developers, and AI tools to help accelerate development (or perhaps because of them), the features’ demands exceeded our ability to support them.
Generative AI opens new possibilities, but it doesn’t change the need to focus on solving one problem really, really well, instead of solving many problems poorly.
A smarter approach would have been to narrow in on a more particular problem by engaging with several potential customers early, identifying common pain points, and seeing what solving these pain points was worth to them. An outreach-driven approach would have given us a starting point to find one thing to focus on and excel at. We had a list of over 100 friends and former co-workers to start with, many of whom had already expressed interest in learning more.
We waited to talk to them, which I deeply regret. I’ve only recently realized, irrationally, I thought that having demos in hand would be a prerequisite to being taken seriously, or getting good feedback.
This is just not true in hindsight. There are a decent amount of investors who will hear an early pitch without a product existing at all. The bar for talking to former colleagues is even lower than that.
Regardless, by the time we did ramp up outreach, and open up access, we had trouble getting more than a handful of users to try it! The feedback we got from developers was that they weren’t sure what to do with the product, or if it even solved a problem for them.
These combined factors meant we needed several more iterations to reach an MVP. With a few more months of runway, it’s possible we could have started tuning Limo toward serving a particular development community or industry. But our funding dried up before we were able to find out.
Lessons Learned
Solve a specific problem well, not a broad problem poorly
Outreach leads to problems worth solving
Don’t wait for a demo to be ready to engage
Funding Issues
If you ignore sweat equity, Limo was completely bootstrapped from the cashflow of Liminal Labs, a boutique consulting agency I started in 2023. Liminal Labs was/is niche and, realistically, not a huge cash cow. I don’t value our time very well, and we’re bad at marketing.
Anyway, to make Limo happen, we offboarded most of our clients. This crippled our revenue but freed up engineers.
Anthony, Liminal’s longtime employee, prepares a cashflow report we’ve reviewed weekly for the last two years. This meant we knew what we were getting into to some degree. We braced for impact, while simultaneously plotting a careful escape route toward a seed round.
Pretty much immediately, everyone faced reduced or no wages. We had to let go of our junior engineer (which I still feel terrible about). Those of us left were motivated and full of adrenaline. But the reality was sobering: We had started a countdown timer that couldn’t easily be stopped. We had to find some kind of revenue or funding by the end of August or it was over.
To us, that meant quickly making something we thought people would want to use, getting feedback knowing we’d be wrong, and iterating to close the gap. Using all these awesome AI-integrated tools like Devin and Windsurf would surely speed up development. Maybe along the way, we’d get lucky and meet some investors.
Indeed we did have some contact with a few investors. We also received deep, meaningful feedback on our pitch and demo from former friends and colleagues experienced in startups, venture capital, technical leadership, and more. These people were more than generous with their time. If you’re reading, thanks again for your help!
Everyone’s concerns were consistent and clear: competition from bigger players, no existing traction, a lack of a track record or founder with a previous exit, no particular expertise in AI or accolades in DevOps.
Undermining things further was our own lack of confidence that what we had was ready for customers. It’s hard to market or pitch something you’re not confident in yet. I can be confident, but it has to be authentic. I’d have been lying if I endorsed Limo in the state it was in back then. Early on, Devin had created a lot of slop for us to clean up. Quality sucked. It wasn’t worth paying for.
Improvements had to be made before it was ready. To compensate, we worked tirelessly on features and bugfixes. We sharpened our pitch deck and demo. We applied to startup accelerators over and over, occasionally receiving a polite rejection back. We eventually started getting help from Mort, an actual, really good product owner, which re-energized things for a bit.
But there’s a cold and cruel reality to cash and once we ran frighteningly low on it, we had to lean back into consulting. We attempted to rebuild a sales pipeline, got some small deals, and made it close to some bigger ones that never panned out. It felt wrong, it was wrong, but I wasn’t sure what else to do. We couldn’t skip paychecks forever.
All these side quests ultimately ended up being counterproductive. I spent critical time away from Limo when we needed more thought put into features, quality, and business development.
We didn’t meet our dream investors quickly enough, nor did we generate enough sales to buoy things. In the midst of our first batch of user testing, we ran out of cash to continue working on Limo while keeping Liminal Labs solvent. Game over 👾
Lessons Learned
To get investment, you need relevance, credibility, or traction
AI development tools bring down costs less than anticipated
Don’t get distracted with side quests
Learning Curve
As a software engineer, I consider myself to be more of a generalist. I’ve built apps on the web, mobile, and desktop; on touchscreen kiosks, room-sized imaging equipment, and Kinect-powered basketball courts deployed to shopping malls in China. I’ve worked on C# monoliths and Python microservices. Maintained Kubernetes clusters and multi-tenant cloud platforms. My specialties tend to follow the technologies and domains of the projects I’m working on, over any specific languages and tools I prefer working with.
While some had rougher learning curves than others, in the end, I’ve been able to become proficient enough to succeed in most types of software development. Troubleshooting and reverse engineering in particular are useful muscles one can build up to solve many kinds of technical problems. This is because in traditional software, you can find explanations for why something works (or doesn’t work). When stuck, you can dig through code, ask someone, or recreate a conceptual model of what’s going on to analyze it further. It may be incredibly frustrating and time-consuming to track down the reasons things work the way they do. It’s frequently irrational, but it’s at least deterministic. It’s possible.
When building Limo, which ultimately relied on GPT-4, techniques like reverse engineering and troubleshooting proved fairly useless. There were sometimes no explanations to be found, no one to ask, no concepts to latch onto. Models are inherently unpredictable by nature. More often than not, we were able to get the model working reliably for a narrow use case, such as queries similar to “list my repos”, which listed GitHub repositories by making a tool call using Octokit. The hype is not completely baseless. We found LLMs do perform well under certain conditions.
The “certain conditions” were the problem, however. In our development process, we were never really sure what those conditions were, as they were a moving target. Consequently, we lacked insight into how well Limo worked at any given moment. The UI and backend remained relatively stable thanks to early investments in a lightweight CI/CD pipeline and low-hanging fruit like unit tests and linting.
But when we’d integrate and test each other’s features, instead of merge conflicts, we encountered new types of collisions we’d never considered, something better described as “context conflicts”.
For example, the “list my repos” example before – under the hood, it was tied to a tool called list_github_repositories. Now imagine that in addition to GitHub, we also added support for AWS’ Elastic Container Registry through exposing a new tool named list_aws_ecr_repostories. What repos should Limo list now: your ECR ones, or your GitHub ones? Or do you mean a different private Docker registry, like Azure Container Registry (ACR)? Or perhaps a different source control provider, like Gitlab?
Well, all of that kind of depends on context. Like most chat-based interfaces, Limo’s natural language interface mostly puts context control in the hands of end users. They are the ones driving the conversation topic, not agents. Users’ inputs are often less predictable than LLMs’ outputs.
We couldn’t anticipate everything a user might do in a system prompt, or how to exclude context from some situations, but not others. Or when context changes were necessary at all. Or understand the degree to which any given context impacted results. Context management, we found, is both an art and a science, one that requires finer attention to detail and deeper domain understanding than traditional software features.
Prompt engineering partially mitigated context problems, as did techniques others have written about like tool loadouts. But we didn’t have a streamlined way of measuring how much any of these improvements actually helped, which massively slowed down progress.
In reaction to this pain, throughout development, we tried using various LLM observability tools to measure reliability and improve troubleshooting. Eventually we landed on Comet’s Opik after some trial and error.
Unfortunately, the learning curve on evals was tough for our team, who is more rooted in traditional forms of automated testing. Evals are not like those types of tests at all. Passing and failing is less clear – it’s more like an experiment in which you have to design, execute, interpret results, and draw your own conclusions about how the software performed.
We all agreed that tracing and evals were useful and important. They were worth investing in if we wanted to build a DevOps agent, where reliability and trust are implicit. And now that we’ve had plenty of hands on-time with LLM observability, it feels like an indispensable component of future AI projects.
But with few resources and little experience, these technologies kicked our ass. Despite repeated effort and attempts, we weren’t able to utilize them to improve Limo before running out of money.
Lessons Learned
Serious AI-native development has a significant learning curve
Traditional automated tests are insufficient to ensure quality for AI applications
LLM observability is crucial for predictable development cycles
False Confidence
Last but not least, we got bitten by something I’m still bitter about: AI development tools. They really fucked us, and we paid a lot of money for them.
I’m not talking about GitHub Copilot or ChatGPT. Those were fine for what they were: chatbots that occasionally wrote useful code snippets or helped work through an idea. At $10-30 / month, they served their purpose well.
I’m mostly talking about Devin, the AI “junior engineer” we hired for $500 / month. You can ask anyone who I met with back then – I was singing the praises of agentic coding, how every developer was doomed within a year. With a PR merge rate of ~75%, I was blitzing through features while still being able to handle clients. It seemed all it would take was a well-written Linear issue, and maybe a comment or two, for Devin to create working software.
It turns out, the progress was mostly a delusion. AI psychosis. Whatever you want to call it.
My weakness was blind faith. The code looked fine. Especially within the context of a single review. I barely noticed the mounting inconsistencies, hallucinations, and bugs that resulted from what I was approving and merging in small chunks. I was focused on tweaking the end result, not following the exhausting, boring details of implementation.
After a few months, we cancelled Devin in favor of Windsurf. By then, it was clear that trusting our autonomous engineer to go off and work autonomously was hurting more than helping. Windsurf was quite a bit better, as we all were able to “steer” the conversation toward actual useful code changes and command executions by watching carefully. Almost like switching to adaptive cruise control from Maximum Overdrive-style self-driving.
But even then, it was often too easy when stumped, impatient, or simply burned out, to phone things in. To vibe a prompt loosely resembling requirements into Windsurf’s agent mode and create a Pull Request. Reviewed by busy and unsuspecting teammates, AI slop often looked just fine, getting approved, merged, and deployed with subtle issues.
Another unsettling concern, both for Limo and AI usage in general: when problems were noticed by another developer, Pull Request authors often disowned their work, explaining how they had assumed Windsurf’s output to be correct without fully understanding why. With the end of our runway looming closer daily, how does one shrug off surprise and disappointment in a moment like that? Let alone discuss how to move forward, knowing the real work of critical thinking hasn’t started! Echoes of a recent MIT study stir uncomfortably in my head.
The end result was that we chased our tail a lot more than we had on any previous project. It was incredibly easy to get a false sense of confidence that the product was further along than it was. That the features would take less time to complete than they did. That the codebase was healthy and not full of hallucinated booby traps that would ensnare us later.
If we had relied less on AI tools, especially on the backend, I’m almost certain a version of Limo would be public and generally available today.
Lessons Learned
Trusting AI without verifying is a bad idea
AI tools provide negative value when used incorrectly
Human-in-the-loop is still critical for effective AI use
What’s Next
In the end, the Limo we built with the time and resources we had wasn’t compelling enough to raise money or land customers.
Now we are left to contemplate what to do with what’s left: the code itself, and the skills and experience we gained from trying to launch a product.
Short-term, as Liminal Labs, we’ll be posting more technical guides and content around developing reliable agents and AI apps, starting with evals. We hope to share some practical lessons for software engineers looking to take the plunge into their own AI engineering journey.
Personally, I’ve taken steps towards full-time employment in life sciences and shifted my interim consulting focus accordingly. I still believe in the long-term potential of generative AI, but I’m no longer willing to bet my entire livelihood on it.
To be honest, I’m relieved. I can go back to a relatively normal work-life balance. Aside from myself, Liminal Labs is taking on normal business again, with a team back from a fresh tour in the AI development warzone. Not everyone made it (no one is dead to be clear – just seeking employment). But the folks still around are eager to take on the next challenging and exciting project.
Let us know if you’ve got something in mind!
— Austin, Founder + CEO @ Liminal Labs







