Happy to contribute to this excellent series from NPI. You won’t find a better overview of cloud service procurement anywhere, with detailed comparisons of services from Amazon, Microsoft, Google, IBM and Oracle: Read on!
Treat cloud service measurement and evaluation as a center of excellence. If you don’t, it could cost you a lot of money down the road. Read more from InfoWorld.
I remember the first time I encountered the term “Software as a Service,” and it was longer ago than you might think. I was running a red-hot ASP start-up in Houston called ebaseOne. We’d taken a tiny local software VAR, changed the name, secured investment capital, hired some infrastructure experts, built a state of the art operations center, signed a co-location deal for a brand new leading-edge data center and leveraged the existing staff’s application expertise to start offering application access to customers for a single, monthly per-user subscription fee with almost no cost up front. We didn’t develop software – we just hosted a variety of products, on the assumption that customers wouldn’t want to deal with hundreds of vendors for their software subscription needs – they’d want a one-stop shop that they could get all their applications from. For products that weren’t web-based, we used technologies from Citrix and Marimba to distribute updates and manage the customer desktop. Cisco, among other partners, thought we were visionaries. A year after we started we’d gone through a few million bucks in funding, but our market capitalization was over a half billion. Then the internet bubble burst at exactly the wrong moment, and the next round of funding never came. Ouch.
We weren’t the only ones. Another ASP getting attention at that time was US Internetworking, or USi for short. If memory serves, at some point while we were ramping up, USi came up with a brilliant tag-line: Software as a Service. It wasn’t a trademark or anything – just a term they started using when speaking to investors. We started using it too. So did another little company that called itself an ASP at the time but developed its own software: Salesforce.Com. This was in 1999, long before cloud computing (it wasn’t until the end of 2004 that AWS was even available for public use, and it didn’t really catch fire until much later in the decade).
When cloud did first begin getting everyone’s attention there was a lot of head-scratching about the actual definition of it. Many people just thought it was a way to rent virtualized servers. The consulting firm I went to after my start-up looked to me for a definition in 2010. Among other things, I told them this about SaaS: “SaaS can be delivered via cloud computing, but it does not have to be.” And that was true. All you needed to do for the SaaS label to apply to your service was to host applications and charge for access per user per month instead of selling licenses. To a great extent, and to my complete surprise, that is still true today. Now I’m going to talk about why.
Shortly after that, much of what I had said about SaaS became irrelevant. NIST had published their official definition, and the industry adopted it pervasively. In that definition, SaaS was defined as one of the three cloud service models (IaaS, PaaS and SaaS). The uncomfortable fact that SaaS had already existed for a long time and that many offerings were not very cloud-like was simply ignored. SaaS was now cloud, by definition, and all SaaS vendors instantly became Cloud Service Providers, without so much as a “hey, wait a minute.” Customers and pundits could have asked if SaaS services even sat on top of elastic cloud infrastructure, but they didn’t. A lot of SaaS applications still don’t. I believe it is this de facto SaaS = cloud equivalence that has distracted customers from what could really be achieved if the cloud model were fully applied to application services.
A big thing that SaaS products don’t offer today, but could, is real usage-based pricing. Every SaaS provider will tell you that “you pay only for what you use,” because you only pay for the users that are allowed to use software. Time to take the red pill, Neo. That’s really all you were paying for before cloud came along, and it’s called user-based pricing. What happens when a user is out sick and not using the software? Oh yeah, you still pay. SaaS providers offer the same thing today that software companies have been offering as long as I can remember: an up-front fee followed by a periodic fee for updates and support. Sure, your IT folks don’t have to do the upgrades – yay – but financially the only difference is that the bill has changed to monthly instead of annual, if you’re lucky. Many SaaS customers find that signing a deal will still lock them into a multi-year commitment for a certain minimum number of users. That’s nothing new, and it’s in complete opposition to one of the basic philosophies of cloud, which is to pay only for use.
Now, if SaaS services were to measure minutes (or seconds) of the time someone actually has the app active on their desktop and billed you based on that, we would really have something different. The number of users wouldn’t be directly relevant, and you really would be paying only for what you used. If your users started using the software less and less, your bill would automatically decrease in step with that. No more need for closely managing the software portfolio to make sure you aren’t over-licensed. Wouldn’t that be great? I think it would. Over-licensing due to shelf-ware and functional duplication between products can represent millions of dollars in annual software costs for a large enterprise. Start-ups could start out with enterprise level apps and not worry about their solutions hitting a scalability wall. Enterprises heading into a downturn wouldn’t be stuck paying for software they don’t need since charges would go down in lock-step with usage. This is what I call “pay as you grow / save as you shrink.” Want to migrate away from a product and replace it with a different one? Just stop using it whenever you’re ready.
And the data from metering could eventually be used in all kinds of constructive ways. You could see how a change in features, functions, processes, staff, work environment, etc. affect time spent in the application, for example. You could see changes in usage patterns in near real-time and investigate root causes. Of course, you’d need to keep pressure on your provider to make sure that changes to the application didn’t lower productivity, thus increasing their revenue, but that’s all part of the evaluation.
We didn’t implement true usage-based pricing at ebaseOne, but we were working on it. We invested in the high-end monitoring and billing systems we would need to bill our customers for minutes of usage, got them up and running and did the training. This is similar to what long-distance phone companies once did in order to bill for phone calls. Many SaaS providers today don’t have all the required systems in place yet, but only because they don’t need to. They don’t relish the idea of an unpredictable revenue stream, where there are no financial barriers keeping their customers from switching to the competition. If they’re public they have to set expectations for the street every quarter and then meet those expectations, so predictability is good. And yet, Amazon is in exactly the same situation with the on-demand services in AWS, and they’re doing just fine.
The real reason that SaaS providers haven’t offered this is that the market has not demanded it. Yet. Software companies don’t like change and will have to be forced into it by competition. Start-ups – this is your opportunity.
Looking back in time, I mentally divide articles on cloud security into two general “waves.” The first, from the early days of cloud, are those that looked at the shared, public nature of what Amazon was selling and said “the cloud is not yet secure!” In addition to the very idea of shared infrastructure going against everything the security industry was trying to achieve, there were also some quite valid concerns that potential customers needed to know about. Server-based network switching could theoretically be exploited to gain access to VMs that weren’t assigned to you, for example. Data stored in the cloud was not always securely wiped off before the storage space was assigned to new customers, and that made headlines when customers found they could recover information from previous users of the space. CSOs were legitimately concerned about jumping into cloud services too quickly, often citing the need to comply with government or industry security standards like PCI DSS, HIPAA and FedRAMP. The takeaway? Cloud was less secure than your own data center.
The second wave of cloud security articles came after the leading cloud providers began publishing long lists of standards that their customers had, in fact, complied with, along with a commitment to support that compliance going forward. The early technical and procedural difficulties were now in the past and relegated to a bin labeled “growing pains.” At that point, the narrative morphed into something new. “All one has to do is look at what the public cloud behemoths spend on security to know that they can achieve far more than your own, sadly underfunded security operation could ever hope to do,” everyone seemed to say. That narrative is still with us today, showing up in security articles and as casual mentions in more general guidance on cloud services. The takeaway? Cloud is more secure than your own data center.
Both of these narratives are false.
They are false because they are dramatic oversimplifications. Worse, they are actually damaging the quality of critical business decisions being made right now. Here’s what’s wrong with them:
- They don’t specify what is meant by “cloud.” Amazon, Microsoft, Google, IBM and all the rest do not secure their cloud infrastructure in exactly the same way. Securing technology assets is way too complex, and security designs are way too proprietary, for that to happen. Data centers are not located in the same places, which means vulnerabilities to disasters or physical incursions vary, as do the political, economic and legal risks associated with the various jurisdictions that cover them. Is that public cloud or private cloud? Because things like auditability and penetration testing are often far more limited when you’re using infrastructure that might be shared with other customers. On-site or off-site? Because if the assets are off-site you’ve almost certainly increased your attack surface by creating an interface with an external entity that has to be trusted, and that entity also has interfaces between your service infrastructure and its corporate network. SaaS providers often use 3rd parties to host their infrastructure, which means you may have added multiple interfaces to your attack surface. Major cloud providers do have world-class security implementations. They have the money to invest, and they know the risk to their business represented by the threat of a well-publicized breach. They better, because they’re also more enticing targets than you would be on your own. Data thieves would love to find a way to get into all the data stored on a large provider’s infrastructure. When thinking about the potential for a criminal act, you should never fail to consider motive.
- They don’t specify what’s meant by “your own data center.” Who are you? If you’re a small business that has your data center in someone’s converted office and an IT staff that you can count on your fingers, then yeah, large cloud providers probably have better security than you do… as far as it goes. Just remember that OS patches and any interfaces outside the providers’ facilities are usually your responsibility, like the management console that you use to manage cloud resources. What happens when the administrators that can access that console leave the company… in a bad mood… because they quit or were fired? Can all of your backups be deleted from that console? I hope not – don’t forget what happened to Code Spaces. Now, if you’re a large multinational with an IT department as big as a medium-sized company that prioritizes security (I’m looking at you, banks and defense contractors), the answer could be different. You need to look very hard at the provider’s security to determine if it’s a step up, a step down or a lateral move with balancing pros and cons. Do you have unique security needs that can’t be met by a service that’s designed for a horizontal market? You might. Do you have vulnerabilities in your own security architecture that haven’t been closed as the exploits continually multiply? You might. The comparison you’ll need to make will depend in part on who the provider is and on which service you’re considering.
- Just asking the question can be misleading. It gives customers the impression that when they move to the cloud they are replacing all of their own security with the provider’s security, and for basic IaaS that just isn’t the case. It’s true that the provider has taken on physical data center security, network boundary protection and monitoring, the hypervisor and other security concerns specific to the provider’s infrastructure. They’ve also put in place all of the encryption and configurations necessary for a cloud service to actually operate But you are still doing your own patching. You are still managing the guest OS and all the utilities and applications. You are still doing your own firewall configuration too, even though that firewall belongs to the provider. Many IaaS customers are still surprised when they find out just how little of their security responsibility has actually been taken on by the provider. Now, if it’s SaaS or a “managed” service like database, the picture changes. The provider is likely to take responsibility for much more infrastructure security, and some of your attention can shift upwards to identity management and authentication. It’s critical to pay attention to what the provider is actually taking responsibility for.
We’ve only scratched the surface of a very complex topic here, but the next time you see an article that says “cloud security is worse” or “cloud security is better,” I hope that you will give it all the attention it deserves, which is none. The whole reason that The Cloud Service Evaluation Handbook has an entire chapter on security is that the security that is best for you depends on your needs, your capabilities and on the characteristics of each particular service. You’ll need to do an actual evaluation, with participation from your security team, to find out what you’re gaining or losing when you move to cloud.
In case you missed the first two laws of cloud cost optimization, here they are:
- Know Your Application, and
- Do a Thorough Service Evaluation
But let’s say you’re following those already. You’ve got a good grasp of the application’s workload profile, its needs for disaster recovery, security, response time, elasticity, interoperability and more (see The Cloud Service Evaluation Handbook™), or at least you have people on the team that do, and all of those people contributed to the evaluation of the service that you finally chose. You’ve definitely gotten off to the right start, but costs can still spiral out of control even if they were optimal on day one. The key to making sure that doesn’t happen is taking a variety of steps that fall under the heading of good Demand Management. Oh sure, you might think you’ve got that covered if you’ve been doing IT Demand Management for your private, non-cloud assets. Let’s say you already require strong business cases for new acquisitions, you monitor and improve the accuracy of those business cases, you make sure that old equipment and software licenses get reused to meet new requirements or get properly decommissioned, making sure they don’t continue to generate unnecessary costs in the form of management fees, software maintenance, etc. If you’re doing those things, pat yourself on the back because many organizations weren’t doing them very well even before cloud came along.
Now for the bad news: there’s a lot more to good demand management with cloud, particularly public cloud. First of all, it’s more important now than it ever was. An unused server sitting around in your own data center may not have been a big deal if you didn’t need the space for anything else. In the cloud, however, you’re paying for that unused resource every month, just throwing money straight out the window. The “usage” that cloud providers measure when they create your invoice is determined by the number and type of resources that you, the customer, have allocated to yourself. It’s not what you are actually using to do useful work at the moment, and financial people are sometimes surprised by that. Since the providers make more money when you over-allocate or forget to turn something off, you can’t expect them to do much to help you. Even when they provide tools that let you see where your issues are, it’s still up to you to use them diligently to keep your charges down. Here are a few leading practices that you can use to optimize costs. If you aren’t already using them, you could easily be overpaying by 20% to 50% or even more.
- Reassess your service decisions continually. Application workload characteristics change, and the service you chose initially was greatly influenced by what you expected those characteristics to be. You may have selected on-demand instances thinking that your application could release them most of the time. Is that what you’re seeing? Does your application even have the ability to release resources automatically, or would you benefit from a simple tool like ParkMyCloud that lets you deactivate them on a fixed schedule? Or you may have selected dedicated or reserved instances thinking that workloads would be fairly steady around the clock, and that may be true for some of them, but now that the workload has grown, perhaps some of the resources aren’t needed during off-peak hours. Would going to an on-demand scheme for some of them save money? If usage is very light and it’s the right kind of application, perhaps a FaaS solution like AWS Lambda, Azure Functions or Google Cloud Functions would now be a better choice, or even a Google “preemptible” VM.
- Reassess your configuration choices continually. For the same reasons that your service choice may no longer be optimal, your configurations are even more likely to need re-optimization as time goes on. Memory requirements, in particular, can change quickly. Performance bottlenecks will force you to go to instances with more memory if you need to, but there can also be opportunities to go to smaller memory configurations to save money. Your developers can and should be looking for ways to make your applications more efficient, and money saved on your cloud bill is the measurable payout for that, not only from charges for VMs and storage but for data transfer as well.
- Look for “zombie” resources continually. Zombies are resources that aren’t being used because they’ve become “unattached” or simply aren’t needed for running your applications. Storage volumes not attached to a compute instance are common examples, and you may be paying for the entire capacity even though you’re storing nothing on it. For IaaS, IP addresses, databases, load balancers and even VMs can all be zombies. One of the great things about cloud is the ability to try out ideas and then abandon them without making a large investment in infrastructure first, but zombies are a natural consequence of that. Also, if you aren’t currently “tagging” your resources, start now. You’ll want more tags for reporting purposes, but an Owner tag is critical for determining which resources are really zombies before you deprovision them. User accounts can also become zombies when employees leave or change jobs or when trading partners change, so this is important for SaaS as well as IaaS. Inactive accounts represent security risks as well as unneeded costs. Use your tools to detect resources with low utilization and investigate them.
- Keep doing Hierarchical Storage Management and Information Lifecycle Management, continually. They are still relevant in the cloud, perhaps even more so when the equipment is off-site than when you owned it. Cloud automation makes using lots of storage very easy, and those monthly charges are just going to keep coming until you do something to turn them off. Old snapshots may be okay to delete when you have enough newer ones. Back in the days of mainframes we used to periodically look at the last access date on every file and archive anything that went too long without being touched. In the cloud, you have numerous options on where to put your data, and some are orders of magnitude less expensive than others. Migrating data to options with lower redundancy, lower IOPS, magnetic vs. solid state, “infrequent usage” tiers, long-retrieval-time services (e.g. AWS Glacier), and, of course, deletion are all potential ways to save money.
- Look at opportunities to refresh your infrastructure, You may be thinking “Wait a minute! I thought cloud got rid of the need for us to do that!” Well, sort of. You may not have to rack and stack new boxes anymore, but you also don’t want to keep paying for your provider’s fully depreciated old equipment unless it has some unique feature that you still need. There’s a common misconception out there that the rapid price drops in the public cloud market are automatically passed on to customers, and, as we’ve discussed previously, that often isn’t true. First of all, they aren’t coming as quickly as they used to, and when they do they may only be implemented on new resource types, so if you don’t refresh the resources you’re using with the newer technology, you get zero benefit. How price changes take effect is entirely up to the provider, however, so it’s up to you to analyze pricing on every new configuration they release to see how it’s relevant to you.
Now, the good news is that there are tools to help with all this. Cost optimization tools are often incorporated in Cloud Management Platforms and help with all the things I’ve mentioned above. If you have a large cloud investment I’d expect the business case for one of them to be compelling (e.g., CloudHealth, Cloud Cruiser, Cloudyn, Cloudability, Cloudamize, Rightscale and more). CloudHealth, in particular, has been good about sharing their wisdom on this topic.
So, what’s the Third Law of Cloud Cost Optimization?
“Clean Up Your Mess, …Continually!”
Last time we covered some of the basics of AWS Lambda, so you should know what the service is, what it promises to do, and a little about how well it meets those promises. If not, check out the last post. Now let’s talk about pricing!
Lambda pricing is based on two fairly simple charges: $.20 per million executions of your function plus $0.00001667 per GB-second of processing. That means that if you specify 2 GB of RAM, the second part of the price will be twice as high as it would be if you specified only 1 GB of RAM. The good news is that the first 1 million executions of your function and 400,000 GB-sec of processing time per month are free. This is great for development, since before you’ve put your functions in production you generally need to do fewer executions than you do after the application goes live, and those test executions won’t cost you anything. This doesn’t include data transfer and storage charges that your application might incur when it uses other Amazon services, however.
So, let’s take a look at what Lambda will cost you if your executions are 50 ms, 100 ms or 200 ms in length. This isn’t completely realistic since your function probably won’t take the same time to execute every time, but it should help you to visualize approximately how your charges will behave. By the way, Microsoft’s pricing for Azure Functions is virtually identical to Lambda’s and has the same 100 ms minimum charge, so this analysis applies there as well. As I write this, Google’s Cloud Functions service is still in alpha test, and I haven’t seen any published pricing yet.
Note that the red line for 50 ms functions is completely hidden underneath the purple line for 100 ms functions due to the 100 ms minimum charge. 200 ms functions cost twice as much as 100 ms, as would anything between 100ms and 200ms. If I added 150 ms functions to this chart, the line would be hidden under the orange line for 200 ms. That’s because execution times are always rounded up to the nearest 100 ms. for purposes of billing. That’s very important to understand if you want to avoid surprises on your bill, and, personally, I’m not a big fan of it. Come on, Amazon – just offer simple per-ms pricing with no minimum duration, and charges will be a lot more predictable.
Now, what about EC2? Since Amazon assigns resources to your function in the same proportion as a general-purpose compute instance, let’s use one of those as a hypothetical comparison. An on-demand m4large with 8 GB RAM currently costs $.108 per hour, which is $.0000037 per GB-second. If you reserve the instance for 3 years and pay up-front, it only costs $.04 per hour, or $.0000013 per GB-second. Let’s just assume that your functions execute sequentially as quickly as possible for an entire month (not real-world I know, but stay with me – this is just a hypothetical), and that once you’ve filled up a month of execution time you would need an additional instance to process any more. That tells us how many executions you can process per month on an EC2 instance and gives us a comparison that looks like this:
At very high loads on your EC2 instances, which corresponds to very high numbers of executions, Lambda is 4.5 times the cost of the on-demand instance and 11.4 times the cost of the reserved instance. So it’s a terrible deal, right? Well no, not really, and here’s why:
- On-demand instances won’t work. Lambda is designed for functions that must execute rapidly after being triggered by events. To get the benefit of on-demand EC2 pricing you would have to spin up that instance each time the function was called, and that’s the assumption that underlies the orange line in the chart. Unfortunately, that would add a tremendous amount of processing overhead, and your function would be dog slow. To use EC2 at all you would need your instance to be already available and waiting for a triggering event, and that means 100% instance usage per month. Reserved instances are always cheaper than on-demand at 100% usage.
- Your functions are very likely not going to run 100% of the time, and that’s the assumption that appears to be built into Lambda’s pricing. In this hypothetical scenario, Lambda’s sweet spot is on the left side of the chart, below 3 million function executions per month. Your mileage will vary a bit because your functions won’t take exactly this much time to execute (the further they are from a multiple of 100 ms, the less attractive Lambda will be), and you’ll want to be operating in the range where the free executions that come with Lambda still have a noticeable impact on your bill. That’s also where a reserved instance would still be very underutilized. If your traffic starts pushing you to the right side of the chart, consider using reserved instances to save money. Just make sure you take the next point into account as well.
- Lambda and EC2 can’t be directly compared without any adjustments, which is why I added the note at the bottom of the chart. We have an apples and oranges issue. Lambda is far closer to being a fully managed service than EC2 is. It includes things like patching, disaster recovery, backup and pretty much all standard IT management tasks that go with having a server other than security, as I mentioned in my last post. I estimate that to be worth about twice what you get from an EC2 instance in terms of reducing your internal IT costs. So, in our example, instead of switching to a reserved instance at just under 3 million executions per month, you should probably wait until about 6 million if you care about optimizing TCO for your infrastructure long-term. And that doesn’t include savings in the development cycle. Lambda is designed to trigger your functions in response to events that occur in social media streams, stored data, monitored states and other event sources. That should mean less coding for you. It’s also designed to auto-scale by default, without making you build that intelligence into your application. I’m not sure how much time this saves, but I’d love to hear from any developers out there with experience that can provide some input.
Hopefully that will help with decision making around when to use Lambda and other FaaS offerings, and when not to. Use them for event-triggered processes that need fast response times but that aren’t going to run in such high volumes that they would justify reserving an entire instance. Test your functions so that you can generate appropriate charts like the ones above and predict your costs. If you try out FaaS knowing what to expect, on applications for which it is well suited, I believe you will be very pleased with the results.