Menu

Experiences with Fail Safer

2 Flares Twitter 0 Facebook 0 Google+ 0 Pin It Share 0 Reddit 0 LinkedIn 2 Filament.io Made with Flare More Info'> 2 Flares ×
Fail SaferDuring the past couple weeks I have been slowly bringing www.failsafer.com to a temporary halt. The whole process is not over yet, but although this halt isn’t meant to be permanent.
Fail Safer have been a great entrepreneurial and technological experience in my life, and it pains me to not have the time I wish I could have to get past the beta phase and actually launch and market it.
Even though it was a beta release, Fail Safer was already fully functional and it was crawling the web for new websites and monitoring them actively with an astonishing accuracy and performance with an incredibly small amount of servers.
With a shoestring budget and developing it on my own in my spare time, it managed to keep up with more than 39 thousand websites in different countries and languages, monitoring all of them every minute and making smart decisions on whether a site has come off-line, whether it is having issues (returning any range of 500 errors of behaving in unexpected ways), and adequately changing the status in the site’s status page and keeping an accurate history of the ups and downs of all of those sites. All of that using economy hosting services in various locations around the world for monitoring, and cheap digital ocean boxes for main servers.
It may sound easy, but there is nothing of such when you’re monitoring dozens of thousands of websites every minute and need to make smart decisions about their status and accurately keep track of their health. The biggest part of the work was actually dedicated to overcoming this challenge in a smart and efficient way.
Timing has not being a good partner in this business, though. Right after I began unrolling its beta, I started conversations with my current partner and some investors to start a studio company to create startups, in which the first project was actually going to be one of the side projects I’ve been developing with my partner for a while now.
I left the service running and doing its job, while I periodically monitored its performance and stats. It came to a point, though, where it is clear to me that it will take a while until I can come back to this project and give it what it will take to properly launch it and make it successful. And that is why I decided to bring it to a halt for now.

Lessons learned

Even though Fail Safer is yet to be endowed with marketing and be properly introduced to the world, it was a real incorporated company since the beginning, with bank accounts, EIN, payment processing merchant account, and even a toll free phone number! Going through the process of establishing a company here in the US for the first time (I’ve done so before once in Brazil), and getting the system ready for launch was a very enlightening experience.
Another critical lesson is that time is money, and vice-versa. When you have enough time, you can save up a lot of money. When you don’t, you need money to compensate for the lack of time. Since time was the most accessible currency I had during the time I was developing Fail Safer, I took the longer route and used my spare time to move it forward the best I could. That took me longer than I wish, but kept my endeavor within budget. Having a ton of money isn’t always the answer, though. Even if I had an unlimited resource of cash there is no assurance that all the money I would have poured in for a fast result would have ever be returned. An entrepreneur needs to find the balance between time and money. You don’t want to pour a lot of money to test the water and compromise your ROI, but you also don’t want to be working on it for so long that the time you get to test the water, the market is already different from when you started.
Get your ducks in a row before you start developing, and most importantly before you launch. It is like begin a building construction without blueprints or properly understanding the owner’s needs. You’d think you’ll save time by skipping ahead and do these things in parallel, but you’ll only suffer pain and setbacks. Same when you’re trying to deliver the building. You’ll want to streamline work as much as possible and plan such that there are going to be very little adjustments needed, as opposed to leave everything to be figured out at the end. Think of a bell curve, not a hockey stick. You don’t want to be solving a ton of problems right next to launch date. What you really want, is to be relieving development pressure around the launch to invest more of your time into the launch and marketing of the product. In my case, the actual launch and marketing part never came, but it paid big time that everything was basically ready when I began incorporating and doing the final tweaks on the site. I left everything running and testing for days to catch bugs, fix, and confirm that everything was working as it should.
Bell Curve (range -3:3, μ 0, σ 2) Hockey Stick Graph
Marketing is hard, and it requires a lot of effort and money to be done right and have a ROI. Companies are out there paying high dollars for the prospect of landing a customer that will pay out the marketing investment over time. I knew that with Fail Safer it wouldn’t be different, and it really wasn’t. The site was never formally launched and no money or effort was invested in marketing, but  the site has been online and monitoring dozens of thousands of websites while constantly indexed by Google and it never acquired customers.
We’ve all seen business ideas that started with little or no marketing that quickly became viral. Because we typically fall in love with our ideas, specially when they solve a pain-point that is particularly interesting, we become certain that our idea will too be viral and little or no marketing effort is need to have people shell out their money for our product or service. The belief that our product will be a viral is due to pure confirmation bias, and unfortunately, reality couldn’t be farther from that.
There are no cheap shortcuts in marketing, and it really pays off to properly understand your customers, create a marketing strategy and be ready for the launch. Remember the bell curve? The left tail is when you’ll start your strategy, but it is the right tail that will allow you to be ready for launch. If your development effort looks like a hockey stick, forget about it. Not only launch will be messy, your marketing strategy will be poorly executed.
Last, but not least, my greatest lesson from this venture was to always have a plan, knowing that the plan will change, and be ready to adapt as needed. A project with no planning is very bad. A plan strictly executed from start to end as initially envisioned is certainly better than no planning, but it is not a whole lot better. Some teams claim to be agile by adopting a “we’ll figure things out as we go” attitude in designing and developing software. Agility doesn’t come from operating without planning, but by being ready to alter the plans and adapt as required. Being agile without a plan is the same as trying to get from one locations to another the fastest you can but without using a map and properly planning your route because you’re afraid that the route could have road blocks that your plan couldn’t account for. The GPS has the right attitude when it comes to agile. It gives you a route from start to end, and keep adjusting the route whenever there is a change, never loosing focus on the actual goal and trying to get you there as fast as possible: true agility comes from proper planning and being ready at any time to adapt the plan and maintain your project on course.

Technology

On a glance, it may seem simple to slap some code together to monitor a handful of websites. And, it is indeed pretty simple.
The problem really arises when you try to do that at scale. And it doesn’t take too many websites to reach the first road blocks.
On average, a single request to check the health of a website takes about 1 second. Add to that the parsing and computation around understanding the status of the site, quality of the response, etc. From start to end you’re at about 1.5 seconds per healthy site.
If a given site is going through issues, the problem is much worse. The response time will be much longer (sometimes over 1 minute), you may get timeouts, 404s or other pages for which the status code is 200 but is actually showing that server is distressed. Add to that the fact that once a site status is deemed under distress you have to get other servers in different regions to confirm the status and you end up with a big problem to solve.
If we stick with the healthy websites, monitoring 10 thousands of them in one minute requires us to check on 250 of them per second! The amount of websites Fail Safer was monitoring right before I brought it to a halt was 39 thousands, requiring 975 full checks per second. No only this is a very large amount of requests per second, it demands a very large bandwidth, and causes an insane amount of writes and reads on the database.
If the database models were not well designed, not only the monitors would fail to store the status verified from any given website but the status page would be mostly unusable due to long lock waits and potentially timeouts caused by the frequent reads and writes around these records.
I took great pleasure in solving these challenges and managed to achieve a very high performance with very few economy servers. And the result was status pages that would always load instantaneously regardless of load on the database. I used a mix of MySQL and Redis, created my own mapper for Redis that can relate to MySQL entries easily and vice-versa. Used keys that would let me shard and scale the servers easily, and made extensive use of aggregations and other techniques to make read operations as instantaneous as possible. The writing tasks were also fast: all the computation was done in memory, and at time of write it would take only two or three operations.
Other challenges I had to overcome included detecting whether access to a given site was unavailable only for a specific region or not before considering it as unhealthy. That involved prioritizing checks from servers in different geographical locations to the same website. If servers would consistently fail to access websites while other servers in different regions are seeing the site, the website checking load would then be re-arranged to take the bad servers out of the way and avoid skewing results on the status page.
All of these challenges taught me a lot on scaling, designing robust code, tasks, and databases. Despite the bitterness of having to defer this launch to a later date, I already feel successful for having designed a very efficient system that scaled so well and delivered results astonishingly fast.

Startup Foundry

Every entrepreneur is plagued by the curse of having new ideas on how to solve problems. I said plagued and cursed because it is a torture to have so many good ideas and yet being tied to the status quo of only be investing their time in one at a time. Startup Foundry is the dream company of any entrepreneur, really. Think of a business with the sole purpose of developing new ideas every few months. Yes, I love the idea too. And I’ve been fortunate enough to be part of this venture. But in order for this to work, I have to give it my undivided attention for now, and that is why I’ve decided to bring Fail Safer to a halt instead of charging forward with a formal launch and proper marketing. Stay tuned though: like I said before I intend to bring it back in the future and give this idea a real chance with an adequate launch and marketing.
2 Flares Twitter 0 Facebook 0 Google+ 0 Pin It Share 0 Reddit 0 LinkedIn 2 Filament.io Made with Flare More Info'> 2 Flares ×
2 Flares Twitter 0 Facebook 0 Google+ 0 Pin It Share 0 Reddit 0 LinkedIn 2 Filament.io 2 Flares ×