Jay Taylor's notes

back to listing index

How to burn the most money with a single click in Azure | Hacker News

[web search]
Original source (news.ycombinator.com)
Tags: cloud cloud-computing aws azure opm news.ycombinator.com
Clipped on: 2020-04-06

Image (Asset 1/2) alt=


Image (Asset 2/2) alt=
A few years ago my startup was killed by a AWS mistake that ran overnight. The irony: my AWS expert at the time had made exactly the same provisioning mistake at his previous job - so I figured he'd never make a $80k mistake again. It turns out - his mistake with my startup was even more impressive. More positively - he did help shell out with me to cover the cost & overnight we were out of money. The mistake shocked me so much, and I've since heard so many stories of similar mistakes. The event hit me so hard I went back in time to PHP and shared hosting. Not kidding.

What was the mistake?

Running up a shitload of instances for testing and leaving all of them running overnight. Each of these instances continually rendered 4k video data to storage. This kind of test was supposed to be 1000x smaller, running for at most 10-20 seconds at time. He had written his own provisioning system which - according to his report - failed to properly manage instances "weird" edge case. No kidding.

Every morning I would check AWS billing just out of habit. I'm just thankful I did - otherwise everything would have kept running...

The lesson for me was don't trust your internally-hacked-together instance management system. The AWS interface to storage and instances is the base truth. And perhaps more importantly - I'm never getting into another startup which has financial risk like that without being a core expert in that risk/tech. I was focused on the business + client code - and had very little clue about the nitty-gritty of AWS. I should have been more involved with the code on that side, or at least the data-flow architecture.


SRE here. I feel for your situation. Here's some advice. One simple thing you could do is set up AWS billing alarms and have them delivered to a notification app like PagerDuty.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori....

If you don't want to pay for PD, you can patch together any number of ways to get your phone to scream and holler when it gets an email from ohshit@amazonasws.com. It's also good to have clear expectations as to whose responsibility it is to deal with problem x between the hours of y and z and exactly what they are supposed to do.

Keep the alerts restricted to the really important stuff, because if your team becomes overloaded with useless alerts they will 1) dislike you and 2) be more prone to accidentally mistaking a five alarm fire for a burnt casserole.

There are more complex systems you could build, but that's a start.


Thank you for this. How can anyone run ANY service with ANY company and not add a clause in the contract (and then have the alerts up an running) in controlling costs?

I remember PagerDuty was advertising (a lot) on Leo Laporte's podcasts a few years back.

A clause in the contract: if monthly bill reaches $Xk amount then:

(a) seek written approval by client, and

(b) continue until $Yk or approval is given with a new ceiling price.


I was just playing around with AWS a while ago and was surprised that I could not find any option to put a cap on the amount I'd spend in a month. Only thing I could do was set up alerts.

I imagine AWS would have 0 problems suspending all my services if I can't pay, so why can't it do the same thing when it reaches my arbitrary cap?


> I'm never getting into another startup which has financial risk like that without being a core expert in that risk/tech

This may be something that is 'unstated', but unless you actually had access to fix something that was wrong, as well, being an expert in that wouldn't really help all that much. I've been in situations where I have explicit/expert knowledge of XYZ, but when the people responsible for XYZ do not take your input, and/or don't provide you the ability to fix a problem, expert knowledge is useless (or worse, it's like having to watch a train wreck happen when you know you could have stopped it).


This. But on the other hand, you can be ready with the popcorn when shit eventually does hit the fan.

And then have to live with asking yourself "could I have done more?"

"...could I have saved the day if I were willing to loudly complain until someone listened?"

As in beer and crisps? /s

On the other hand, it sounds like you hired someone who wasn't really up for the level of responsibility given. :(

In theory ;), you shouldn't have to be a core expert in everything. But yeah... in the real world, things aren't so cut and dry. :/


TBH, the real problem is AWS bills cannot be capped in any way (you can setup an alarm, though). It's unreasonable to expect a programmer won't make mistakes.

Of course they can be capped, you just turn off the services. If you're asking them to automate that for you, then the counterpoint would be people accidentally setting a budget that wipes out their resources and complaining about that.

Easier for both sides to just ask AWS for a refund if there's a reasonable case.


> the counterpoint would be people accidentally setting a budget that wipes out their resources and complaining about that.

This wouldn't be an issue if it was configurable.


Mistakes will always be an issue. How you recover is more important.

Would you rather make a mistake leading to a big bill with the possibility of a refund or set your max budget and have your resources permanently deleted?


There would be no need to delete existing resources. Just prevent me from creating new ones until action is taken. For small projects in particular, I'd much rather have service taken offline and an email notification than even a $1000 bill. And $1000 is small in the scale of what you could end up with on AWS.

It's the existing resources that are a problem because most of them have a steady-state cost.

EC2 instances, EBS volumes, S3 data... should AWS delete those when you hit your budget? How do you stop the billing otherwise?


> How do you stop the billing otherwise?

With prioritisation, so the non-steady state services are stopped/killed with plenty of time to leave the needed foundations still running. :)


1) If you're AT the budget amount then everything must be deleted to avoid going over.

2) If it's a soft budget then it's no different than the alarms you already have.

3) If you want to stop it before it hits the budget, then you're asking for a forecasted model with a non-deterministic point in time where things will be shutdown.

This just leads to neverending complexity and AWS doesn't want this liability. That's why they provide billing alarms and APIs so you can control what you spend.


> 2) If it's a soft budget then it's no different than the alarms you already have.

Not if I'm busy, or away from work, or asleep. There is a massive difference between getting an alarm (which is probably delayed because AWS is so bad at reporting spent money) versus having low priority servers immediately cut.

Even without a priority system, shutting down all active servers would be a huge improvement over just a warning in many situations.


That's not a soft budget then, so which option is it? 1 or 3?

You want it to selectively turn off only EC2? Does it matter which instance and in which order? What if you're not running EC2 and it's other services? Is there a global priority list of all AWS services? Is it ranked by what's costing you the most? Do you want to maintain your own priority of services?

And what if the budget was a mistake and now you lost customers because your service went down? Do you still blame AWS for that? Or would you rather have the extra bill?

There is no easy solution.


It's really not that complicated. "Stop paying for everything except for persistent storage" is sufficient for the majority of use-cases where a soft cap would be appropriate. When you need to do anything fancier, you can just continue to use alarms as you do now. A tool does not have to solve every problem that might ever exist to be useful.

It's really not that complicated... to watch your own spend. But yet everyone here keeps running into issues, and that's just with your own projects. I'm sure you can at least appreciate the complexities involved at the scale of AWS where even the minority use-cases matter.

"Everything except for persistent storage" is nowhere near useful enough to work and can cause catastrophic losses. Wipe local disks? What about bandwidth? Shutdown Cloudfront and Lambda? What about queues and SNS topics? What about costs that are inseparable from storage like Kinesis, Redshift, and RDS? Delete all those too? And as I said before, what happens if you set a budget and AWS takes your service down which affects your customers?

It's easy to say it's simple in an HN comment. It's entirely different when you need to implement it at massive scale and that's before even talking about legal and accounting issues. There's a reason why AWS doesn't offer it.


Just shut down everything, but don't delete existing data written to disks. That can cover a wide array of budget problems. If you set a budget like that you really do not want to go over it and any potential loss from customers is not as huge as going over that budget. At least have that option.

I sometimes for example fiddle with Google APIs. I do not even have customers so don't really care if things will stop working, but I have accidentally spent 100 euros or more. I have alerts, but those alerts arrived way too late.

I make a loop mistake in my code and now I suddenly owe 100 euros...


> "Just shut down everything, but don't delete existing data written to disks."

I literally just explained why this doesn't work with AWS services. You will have data loss.

And it creates a whole new class of mistakes. If people mistakenly overspend then they'll mistakenly delete their resources too. All these complaints that AWS should cover their billing will then be multiplied by complaints that AWS should recover their infrastructure. No cloud vendor wants that liability.


It's not an unreasonable use case to just nuke everything if your spend exceeds some level. (I'm just playing around and want to set some minimal budget.) But, yes, implement that and you will see a post on here at some point about how my startup had a usage spike/we made a simple mistake and AWS wiped out everything so we had to close up shop.

ADDED: A lot of people seem to think it's a simple matter of a spending limit. Which implies that a cloud provider can easily decide:

1.) How badly you care about not exceeding a spending threshold at all

2.) How much you care about persistent storage and services directly related to persistent storage

3.) What is reasonable from a user's perspective to simply shutdown on short notice


Don't let the perfect be the enemy of the good. In so many use cases, shutting off everything except storage would do a good job. And the cloud provider doesn't have to decide anything. It's a simple matter of setting a spending limit with specified semantics. A magic "do what I want" spending limit is not necessary.

> "shutting off everything except storage would do a good job"

Except it wouldn't. This is the 3rd time in this thread explaining that. Edge cases matter, especially when creating leading to new mistakes like setting a budget and deleting data or shutting off service when customers need it most.

If it's not a hard budget but a complex set of rules to disable services... then you already have that today. Use the alarms and APIs to turn off what you don't need.


Keeping those resources for a week but completely inaccessible would not be a huge cost for AWS yet a very big relief for startups.

And this happens every time you go over budget? So it's a constant monthly emergency credit? Or extended free tier? Is there a dollar cap on that? What happens if you go over that?

Not so simple.


> Of course they can be capped, you just turn off the services.

That's not a he's cap, since turning off services isn't instant and costs continue to accrue. But, yes, there are ways to mitigate the risk of uncapped costs and they are subject to automation.


See the sibling comment thread. It's just not that simple. It creates a lot of liability, could lead to permanent data loss, and doesn't really prevent any mistakes either (just swaps them for mistakes in budget caps).

AWS would rather lose some billings than deal with the fallout of losing data or critical service for customers (and in turn their customers).


it depends on the use case. For example, I would like to have developer accounts with a fixed budget that developers can use to experiment with AWS services, but there isn't a great way to enforce that budget in AWS. In this case I don't really care about data loss, since it's all ephemeral testing infrastructure.

In theory I could build something using budget alarms, apis, and iam permissions to make sure everything gets shut down if a developer exceeds their budget, but if I made a mistake it could end up being very expensive. Not that I don't trust developers at my company to use such an account responsibly, but it is very easy to accidentally spend a lot of many on AWS, especially if you aren't an expert in it.


So now we have another potential mistake - you setup a "delete everything/hard budget" for a production account instead of a developer account. What then?

It's impossible for AWS to know how to handle hard caps because there are too many ways to alter what's running and it's too contextual to your business at that moment. That's why they give you tools and calculators and pricing tables so that it's your responsibility (or a potential startup opportunity).

Money is easy to deal with. Alarms work. Bills can be negotiated. But you can't get back lost data, lost service, or lost customers.


Should be cap so you have a check. If your system does not allow threshold or assertion, please do not use it. If your cloud system do not have capped budget so you play in and alert you when you soon run out, do not use it.

>In theory ;), you shouldn't have to be a core expert in everything. But yeah... in the real world, things aren't so cut and dry. :/

Right. In my experience, if you don't understand what's going on beneath your abstractions, you're always in for a world of hurt as soon as something goes sideways.


Did you reach out to AWS support or your account manager? They’d definitely have worked something out.

Did you contact AWS and let them know it was a mistake?

They have a good track record of cancelling huge bills the first time they happen


AWS should have a cost cap. Set a max spend value and shut down all servers if you spent it.

> AWS should have a cost cap. Set a max spend value and shut down all servers if you spent it.

That might make sense for some particular services (e.g., capping the cost on active EC2 instances) but lots of AWS costs of data storage costs, and you probably don't want all your data deleted because you ran too many EC2 instances and hit your budget cap.

Where exactly you are willing to shut off to avoid excess spend and what you don't want to sacrifice automatically varies from customer to customer, so there's no good one-size-fits-all automated solution.


I think if resources had an option of "At cap: Do nothing, Shut down, shutdown and erase data" that would cover most of the use cases.

Keeping the data for a week but completely inaccessible would not be a huge cost for AWS yet a big relief for startups.

Assuming you were incorporated and had a business account - declare bankruptcy and the bill goes away. I don’t understand why you would still pay the bill if you were going out of business anyway.

Why didn't I file bankruptcy? This happened in Australia and declaring bankruptcy was not the right thing to do - for many reasons, not the least of which it makes it much harder to operate as a director of a previously bankrupt company, but in the worst case my bank would have just gone after me as I'd given a personal guarantee.

There is no concept of limited liability in Australia?

Even in the United States, most small business loans require personal guarantees which narrowly override the corporate limited liability to make that guarantor liable for that debt if the company doesn't pay. There are some rare exceptions, and possibly more for startups funded by big-name VCs, but I don't know.

But this isn't a small business loan: it's a debt to Amazon.

I read that as the business owner had a preexisting business loan with a personal guarantee.

Except the loan money will go straight to Amazon, and you are now unable to repay the loan to the bank

Where exactly does the bank enter the picture?

Scenario 1: Amazon will ask for the payment (if using cc); the bank will respond there are no funds in the account; Amazon deals directly with the company further directly, not with the bank, eventually getting payment order from the court. If the company went bankrupt meanwhile, Amazon might not get their money.

Scenario 2: Amazon will send the invoice; invoice will not get paid. After due date, Amazon will contact the company directly; bank doesn't even enter the picture, until collection order comes from the court. If the company went bankrupt meanwhile, Amazon might not get their money.

There's no scenario where some hypothetical loan would go straight to Amazon, unless Amazon has some instrument, that instruct the bank to pay them. Something like bank guarantee or promisory note, and uses them before declaring bankrupcy.


I think they were referring to a scenario where Amazon is draining the funds that have already been loaned. Thus Amazon already has their money, and the bank is the one coming after you during bankruptcy.

Not sure how it works in OP's country, but where I live, when you get a loan, you will get a new account. As you draw the loan, you are getting into negative balance; how far you can go is the limit of your loan. As you pay back the principal, you are getting back to zero balance.

So for Amazon draining loaned money, they would have to transfer them to a normal account and pay with debit card paired to that account, with no limit set.

It is not wise to transfer them to a normal account; you pay interest for the balance on the loan account; if you move them to your normal account, you are paying interest for money that is sitting on your normal account.


Wouldn't Amazon be draining a credit card directly? Tied to the account you received the loan on?

If they used CC (not debit), then any payment would mean creating a debt, so yes, they would have to pay to the bank. Because bank already paid in their name.

That's why you don't pay large sums with CC, but with invoice + bank transfer, and have a limit set on your cards, when you do.


Can you explain that more clearly? What is the reason to not pay large sums with a credit card?

Several factors:

- control: you are in control, when you do the payment. You can plan your cash flow.

- additional advantages: You also have payment terms, some vendors offer discounts for earlier payments; if your cash flow can handle that, why would you giving up of that?

- liability: with CC, you are getting credit that is drawn at other party leisure. It's you, who is liable for this credit line, even if the other party made a mistake. You are always liable to the bank, never towards the vendors. With bank transfers, every single payment was authorized by you (where by 'you' I mean authorized person at your company) and the liability is towards the vendor, who is not likely to have such a strong position (see Porter's five forces).

- leverage: if another party makes a mistake, they have motivation to correct it. Every company in existence has already received invoices, that are incorrect. Withholding payment until they are corrected is a strong motivator. Without that, you could be left without invoices that can be put into accounting AND without money that you have to account for.

- setting up processes: when you grow beyond certain size, you are going to want to formalize both the procurement, accounts payable and treasury. Having purchasing and payment discipline that are compatible with that already in place will mean less pain from the growth, less things to change.

When we need people in the field purchasing small supplies, we don't want them to handle cash, so they get debit (not credit) cards, with relatively small limits. It is enough for them to get by, but not enough to make any damage of significance. (The exception is fuel and that's what fuel cards are for - basically it has a form factor of a credit or debit card, but works only for fuel, is paired to a license plate and the vendor sends invoice at the end of the month).

Another scenario, where CCs are useful, if you need to pay something right now; you don't or can't want to wait for the order->delivery+invoice->payment cycle. That's fine for consumer impulse purchases, but that should not be a normal way for company purchases.

Of course, if you start a new business relation, some companies would not trust you, that you are going to pay the invoice; sending advance invoice and paying it is fine. In practice, it is quite rare occurrence.


Depends where Amazon ranks in seniority in bankruptcy (protection). You don't have to run out of money to file for it. Purdue Pharma sure didn't.

I’ve worked in many early startups and I’ve never seen anyone use such a loan.

Were they in the US and funded by VCs? That kind of startup probably doesn't need to do this. Unsure about VC-funded businesses elsewhere. Many or even most small businesses without VC funding do take that kind of loan.

You work at the 1%

The real world is filled with barbershops, daycares, bars, clinics, PVC manufacturers etc

None of them get VC money.

When they need money, they go to a bank and usually have to place a PG in order to get funds.

Tech startups have it easy. Its all equity. You are not pledging your lifetime earnings on a business idea.

Once tech startups lose their upside potential (prob not anytime soon if ever), you will be sitting with the regular folk, those that pledge their skin and life to their business.


If a director becomes personally bankrupt (such as trying to be the good guy and using personal guarantees to take on company debts in an effort to scrape through) then they're banned from running a company until it clears. If they're the director of a company that goes bankrupt, I believe they get 2 chances (companies) before there's a chance of being banned from running more for a time.

Either way it might be nice to keep your options open, depending on your plans.


Or you could just send an email to support and ask them to waive the charges.

If that got to the right person on the right day and they knew it was going to kill the company, it seems likely to help. And combined with the fact that it would probably guarantee future revenue way off into the future...

I have never heard of a case where they wouldn’t give refunds. AWS is competing with the 95% of compute that is not running in the cloud (their own statistics). The last thing they want is a reputation that one mistake will bankrupt a business.

We had spot instances with a mistakenly high bid that incurred thousands overnight when the prices spiked. No refund offered.

I know several other companies that had expensive mistakes without refunds. There's probably a complex decision tree for these issues and I doubt anyone really knows outside of AWS.


> I have never heard of a case where they wouldn’t give refunds.

Really? Working in Southern California a few years ago, refund requests were refused ALL THE TIME. This is why there's a common belief that what you are charged you simply owe them, period.

It may be more progressive now, but let's not be revisionist.


Once I got something like a year of EC2 charges retroactively reimbursed for a few instances I hadn't used.

I've repeatedly seen requests of this nature handled by AWS - 75% cuts to billing, 90% cuts even.

This. I work at Amazon and this is more common than you'd expect. "Customer obsession" and all that.

I'm not the type to 'want to speak to the manager' for my self-imposed problems but the more I hear about people coming out ahead the more I think I need to change my ways.

I think you have to think of it a bit more from Amazon's perspective. If you accidentally burn through your entire startup capital and shut down, they lose. If the risk of this sort of thing becomes well-known, then startups will start using other services rather than AWS, and the small fraction that grow big will be less likely to use AWS.

Being an entitled jerk who blames other people for your own negligence is bad, and you shouldn't change that. But openly giving companies the opportunity to be kind (while admitting that it was entirely your fault) potentially helps both them and you.


Yep, and an opportunity to educate on things like budgets and billing alarms to try to prevent this in the future.

Yeah, every time I’ve heard this story support have always fixed it, at least the first time per account

We used to have a bunch of billing graphs in stack driver with alerting thresholds to pagerduty to capture exactly situations like this.

Why is there no way to set a limit on billing on AWS? Especially for cases like this, where killing testing instances does not have a dramatic negative effect...

Agreed. The simple solution is an expenditure cap. Why can't Amazon implement one? The fear of it going wrong like this would make me keep away from AWS forever.

Wait, is there really not one on AWS? I thought this was the #1 most important feature on any such cloud systems.

It's the very very first thing I set when setting up my GCloud hobby project. I was like, this is fun and all, but I don't care about this enough so I limited it to 3$ per day and 50$ per month. If it goes above, I'm very happy to let it die, and it also gives me a warning so I know something is up. The 2 times it triggered, there was something I managed to fix so the tool is still up and running costing pennies.


I got pegged to the wall by aws once on a hobby project. $1500 racked up in two months. Apparently I left a snapshot in some kind of instant restore state to the tune of $0.75/hr. I used the instance for 2 days, and then shut everything down. Or at least thought I did.

The account I did it on was tied to my "junk" email, so I didn't catch amazon banging on my door saying my payment info needed to be updated. Well until I did happen upon one of the emails. Nearly had a heart attack.

Talked to aws support and they full refunded me. Very very kind of them, but now I'm terrified to touch anything aws.


I don't think an expenditure cap is so simple. Exactly what happens when you hit it? If you have, let's say, 3 RDS DBs and 20 EC2 instances running and a bunch of stuff in S3 and a few dozen SQS queues and a few DynamoDB tables etc, and your account goes over the limit, how do you decide which service you want to automatically cut?

So 90% of the time I hear these horror stories it's a test/dev account where deleting everything is preferable to getting a bill.

I also don't understand why everyone is assuming

"if I hit threshold X do A, if I hit threshold Y do B" where A and B are some combination of shutting down and deleting resources,

is as difficult as solving NP complete.


> Why can't Amazon implement one?

Greed, I'm assuming.


Nowadays quotas give you some safety net. For example you usually have to request more than one GPU to avoid burning money that way, or more than say 32 instances. It should not be possible for a new account to spawn 1k VMs overnight.

The problem with billing is that often these charges are not calculated instantly, and others are not trivial to deal with. For example what happens if you go over budget on bandwidth or bucket storage, but still within quota? What do you kill? Do you immediately shut down everything? Do you lose data? There are lots of edge cases.

You can normally write your own hooks to monitor billing alerts and take action appropriately.


There are service limits on new accounts per region - 20 EC2 instances. These require a ticket lodged to over-ride.

You can still burn an awful lot of money with 20 EC2 instances.

... put a credit card on the account that only has a $1000 limit. Or better yet, a prepaid one.

In this case wouldn't it just cause Amazon to send you a notice that the $10k overnight charge was declined and you should enter another payment method?

How many is a shitload of instances? Are we talking tens, hundreds, thousands?

In my experience AWS had very stringent limits on the amount of active instances of each type (starts around 10 for new accounts, 2 for the more expensive instances). It takes tickets to support then days of waiting to raise these limits.

That should have prevented your company from creating tens of instances, let alone hundreds, unless that's already your typical daily usage.


There used to be no limit on EC2 instances.

Holy crap dude, that's some nightmare shit right there.

Does AWS update the billing console per day or upon request? I get charged per month, but I should add a habit in my habit tracker to learn more about my expenses...


Hourly. You can also set up billing alerts, which will email you.

Be aware that some services bill asynchronously so it can take 24 hours in some instances.

This is what was needed.

What's the technical process to ensure that this never happens? Nowadays, having to have someone "watch" the test and then kill the instances is manual labor which is a no-no. So how do you make it so that your test fires up the instances, and then kills them when the test is done.

I think you have to have an upper bound set with AWS that kills stuff when you have reached the amount of money you want to spend. But of course, people would whine about that. "How AWS killed my business on the busiest day of the year," would probably be the article title.

But I hate far more sympathy for "I made an AWS mistake and got hit with an 100k bill" than "I told AWS to turn off my ec2 instances at 10k, and then at 10k it turned off my ec2 instances"

There are many ways to solve this problem. One way to do this is to model your test infrastructure in CloudFormation. You can then use an SSM Automation Document to manage the lifecycle of your test. Putting all your infrastructure in CloudFormation allows you to cleanup all of the test resources in single DeleteStack API call, and the SSM Document provides: (1) configurable timeout and cleanup action when the test is done, (2) auditing of actions taken, and (3) repeatability of testing.

Not sure if this would help in this particular scenario, but unit and integration testing of operations scripts can save a lot of pain, anguish and $$s too.

It's horrifying how many places treat writing tests for services as critical, but then completely fail to write tests for their operational tooling. Including tools responsible for scaling up and down infrastructure, deleting objects etc.


But if a test fails does it now mean you're bankrupt?

Could do? Not sure what your point is here.

You can do timed instances, and/or make the instances have timed job to shutdown after a fixed time (which is what I use to shut down an instance which only gets spooled up for occasional CI jobs after an hour).

+1. When I had to use AWS for batch workloads, which at the time at least didn't have a TTL attribute on VMs, I made sure that the VM first scheduled a shutdown in like 30 min if the test was supposed to only run in 10 min.

You can use auto scaling groups with a load balancer to terminate instances when not in use and spin them up as required.

This is why it's Terraform or nothing for me.

I'd be fascinated to hear how Terraform would have intelligently known that those instances were not meant to stay on overnight.

Honestly, I'd create the instances using an ASG, then set the ASG size to 0 (or throw inside a while loop until any errors go away). Always create instances from an AMI and always put them in an ASG (even if the ASG only has 1 item min, target, and max on it).

I love Terraform and ASGs but that still doesn't solve the fact that their SRE overprovisioned. They might have even used both things!

This has happened to me several times, albeit at a much smaller scale. I fire up a few GPU instances for training neural networks and when I got to shut the instances down I forget that you always need to refresh the instance page before telling AWS to stop my instances. I still go through all the confirmations saying I do, indeed, want to stop all instances. However, these few times I forgot to refresh to make sure they actually were shutting down and simply went to bed. Not an $80k mistake, but certainly a couple hundred dollars, which hurts as a grad student.

Now I have learned, _always_ refresh the page and instance list prior to shutting anything down and _always_ confirm the shutdown was successful.


Not who you asked, but my mistake was transferring an S3 bucket full of unused old customer web assets to glacier, we were paying a lot to host them each month, and weren't using them anymore.

I set the lifecycle rule on all objects in the bucket, for as soon as possible (24 hours).

About 2 days later first thing in the morning I get a bunch of frantic messages from my manager that whatever script I was running, please stop it, before I'd even done anything for the day.

The lifecycle rule had taken effect near the end of the previous day, and he was just getting all the billing alerts from overnight, it was all done.

I read about glacier pricing, but didn't realize there was a lifecycle transfer fee per 1000 objects (I forget the exact price, maybe $0.05 per 1000 objects). That section was a lot further down the pricing page.

The bucket contained over 700 million small files.

I'd just blown $42,000.

That was over a month's AWS budget for us, in the end AWS gave us 10% back.

On the plus side, I didn't get in too much trouble, and given we'd break even in 4 years on S3 costs, upper management was gracious enough to see it as an unplanned investment.

TLDR: My company spent 42k for me to learn to read to the bottom of every AWS pricing page.


What would have been the correct solution here? Group them into compressed archives first to reduce file count?

One .zip to rule them all :)

Haha, I original wrote "one giant zip file?" but I decided to rephrase it as a more serious answer.

Why would they create a pricing structure like that instead of ultimate total size?

Using a post paid service and getting bill shock.

Not that it killed us or anything, but we hired a Director of DevOps at my company who we tasked with the simple job of setting up a dev server for a Java REST server that would have like 6 concurrent users. It needed a cache, but no persistent database. A task beneath a director and one that the dev team would usually just do themselves, but he was here to show how to DevOps the right way and not be so ad hoc. He somehow managed to set this up to cost like $8000/mo after we have conservatively budgeted for $50. He was fired for myriad reasons and we spent like a week trying to figure out what he had done.

May I ask what kind of background he had? Was it a hiring bet/mistake or was he fine on paper (and probably claiming way too much)?

From a favorite HN comment:

When there is a lot of money involved, people self-select into your company who view their jobs as basically to extract as much money as possible. This is especially true at the higher rungs. VP of marketing? Nope, professional money extractor. VP of engineering? Nope, professional money extractor too. You might think -- don't hire them. You can't! It doesn't matter how good the founders are, these people have spent their entire lifetimes perfecting their veneer. At that level they're the best in the world at it. Doesn't matter how good the founders are, they'll self select some of these people who will slip past their psychology. You might think -- fire them. Not so easy! They're good at embedding themselves into the org, they're good at slipping past the founders's radars, and they're high up so half their job is recruiting. They'll have dozens of cronies running around your company within a month or two.

https://news.ycombinator.com/item?id=18003253

I'm guessing something like the dynamic described here was involved.

The silver lining here may be that he outed himself (literally) before he was able to build an empire of such incompetence.


That's not really it. Our company is small enough that I can talk one-on-one with the head of the tech department and I did give direct feedback about this person. That head of tech was responsible for the mishire, but also got rid of this person pretty quickly once all the feedback accumulated.

My company is service-based and just over 1000 people. Timesheets equal billable hours. It's occasionally very pressurized and we lose people pretty quickly when there's a lull in work, but it also means that useless people have absolutely nowhere to hide.


It sounds like your boss is making these decisions on his own without soliciting additional perspectives and feedback in advance as part of the hiring process. If so, that is a common pattern that, in my experience, leads exactly to these kinds of hires.

But with a fire-fast approach, it sounds like your company can move fast on hires and be ready to contain the damage.


My personal take on it is that a situation like that can be prevented from getting out of hand. But that requires a great deal of courage, often putting the entire business at risk. As a founder you will even come across as as a mean guy if you take on the task of enforcing integrity. Judging the integrity of people often means asking very hard probing, personal questions which I suspect is difficult for most founders.

My own thoughts about this:

https://realminority.wordpress.com/

Disclaimer: Not a founder myself, but have observed one at close range.


If you hire people, you could ask or collect other kinds of feedback how your hire has performed (from someone else than themselves directly of course).

I counter this I've never had good feedback, because of people that wanted a solution, but not from me and sometimes I would bring a solution that will cost less overtime.

I have been bitten colleagues and it still hurts. Because they weren't that great with I.T.

I rather show it off what I can do and what I need to work on. Than relying on somebody else. (Again I have been bitten by that.)


Mishire. I don't want to doxx anyone, but the tech team realized pretty quickly that he was more of a technical manager and not a real engineer. He had a serious neckbeard mentality about being right about everything yet couldn't write a Hello, World on his own. He did little to win people over and got caught reusing work he'd taken from his team at his last job.

I see, thanks for the details.

You know, it happens, to everyone, however good or experienced; what matters for a company's (and individual) sake is how we respond to mistakes.

You guys responded well, that was resilient. The next step would maybe be antifragility. Did something change afterwards, because of this bad experience?


I know most of AWS base services, but it would take real work for me to spend $8000/month on a simple three tier website.

Please share more, these type of stories scratch an itch like no other.

Did you ever figure out what he did?

We only identified two things that were unusual. For one, he used RHEL instances instead of Cent or Ubuntu and the other was he allocated a load of EBS capacity with provisioned iops. Idk if it's even possible to a complete history like if he had done other stuff that he had already undone before we looked.

AWS gives you the tools you need to answer this question. Cloudtrail logs every api action (there may be some esoteric corner cases, I think some aws services have launched features and then weeks later launched "oh those api calls are now recorded in cloudtrail", that kind of thing, but by and large it's good enough).

You should have a "global" cloudtrail turned on in all your aws accounts, with the integrity checksumming turned on, either feeding directly to an s3 bucket in yet another account that you don't give anybody access to or at least feeding to a bucket that has replication set up to a bucket in another locked-down account.

The cloudwatch events console can find some cloudtrail events for you, but you might have to set up Athena or something to dig through every event.


We didn't have enough expertise to do all that nor did we own the billing info. We also didn't spend too much time because it was moot. We shut down everything we could see and ate the bill.

They should give you the option to set a hard limit across your entire account, to prevent you from accidentally spending more money than you have. "If I try to spend more than $5k in a month, something has gone wrong, don't let me do that."

Seems like circuit breakers should be a standard safety feature for automatically infinitely scaling computers.

I would rather my whole system shut down and be unusable while I investigate vs. auto-scale and charge me a bill I can't cover.

However, searching around it seems like I can only get alerts when a $$$ threshold is passed, but AWS won't take any action to stop computing or anything. Please prove me wrong.


>I would rather my whole system shut down and be unusable while I investigate vs. auto-scale and charge me a bill I can't cover.

The counterargument is that you get a usage spike (which is often a good thing for a company), and AWS shuts down everything connected to your AWS account without warning.

I'm not necessarily sure that optional/non-default hard circuit breakers would be a bad thing. But it certainly appears not to be a heavily demanded customer feature and, honestly, if it's not the default--which is shouldn't be--I wonder how many customers, or at least customers the cloud providers really care about, would use them.


The usage spike is very very rarely worth the cost. That’s a pipe dream the cloud providers sell to cover up the fact that these scenarios are sweet sweet profit for them and nothing more. There are very few businesses where making more money is just a matter of throwing some more compute at it.

Nearly every customer (i.e. all of them with a budget) would make use of circuit breakers and it would make Amazon absolutely $0 while costing them untold amounts. Are you really surprised Amazon hasn’t implemented them?


I imagine if usage spikes are not valuable and uncommon then static resources could be less expensive to provision, right?

For example Vultr can give you a "bare metal" 8vcpu-32GB box for $120 a month (Not sure if this is contract or on-demand) vs amazons M5.2xlarge for $205 reserved. $80 might not sound like much, but that's 70% more. Who would love to save ~42% on their cloud costs?


> Are you really surprised Amazon hasn’t implemented them?

It become harmful to them though. At a certain point people feel the hit and avoid the service. Having people spent a little more accidentally and go ‘oh well, oops’ is the sweet spot. An unexpected $80k which kills the company is bad for everyone.


This almost feels like banking fees. A dollar here, a dollar there. In this case it’s a couple of thousand here and there until you can’t afford it anymore lol.

Not really. Everyone thinks it’s can’t happen to them. Certainly me too, and I’ve been using aws since it first launched in some capacity or the other.

How much more does it costs AWS to allow you to spin up resources and then liberally offer refunds when you contact them and tell them you made a mistake?

Not much. Some electricity. Servers are there either way.

Exactly. They decided it was cheaper for them to let you make mistakes and then grant refunds.

If enough people demand them it could become a competitive advantage in the already-cutthroat cloud hosting market

You'd factor that in to the ceiling you set. Maybe your ceiling is 2x or 3x your expected usage. That could still be low enough not to bankrupt your company.

Most cellphone providers provide you with a text message when you're over 90% of your hard cap, and you can login and buy more bandwidth if you really need it.

The same could be done with cloud doo-dads.


You can absolutely set this up in AWS and GCP.

Amazon isn't going to put much effort into automating reminders for you to keep your bill low.

Actually, Budgets allow you to feed an SNS that will dispatch reminders at any point (10%, 50%, 100%) of a total spend amount.

Sadly, some services take as long as 24 hours to report billing.


>I would rather my whole system shut down and be unusable while I investigate vs. auto-scale and charge me a bill I can't cover.

Sure, but most large companies (the kind which AWS gets a lot more revenue from and cares about a lot more) want the exact opposite. Most large companies have the extra cash to spend in the case that a runwaway auto-scale was in error, but on the other hand, completely shutting down operations whenever a traffic spike happens could resort in millions of lost revenue.

>However, searching around it seems like I can only get alerts when a $$$ threshold is passed, but AWS won't take any action to stop computing or anything. Please prove me wrong.

The general advice is to use the various kinds of usage alerts (billing alerts, forecasts, etc) to trigger Lambda functions that shut down your instances as you desire. It takes a little configuration on your part, but again, AWS intentionally errs on the side of not automatically turning off your instances unless you specifically tell it to.


> Sure, but most large companies (the kind which AWS gets a lot more revenue from and cares about a lot more) want the exact opposite.

It does not have to be all or nothing. You could for example setup separate account per department and/or purpose and impose hard cap on spending for experimentation, but not on production.


If the customer dies, the lifetime value of the customer is almost-certainly lower.

Great companies find ways to help their customers thrive.


I don't know what Amazon actually does, but what I think of as normal is the customer calls the help desk and they reverse the charge. This seems simpler than worrying about how to code algorithms that will deal with all possibilities.

People make mistakes with transferring money in the millions of dollars all the time, and it's not uncommon for people to be just like "oops, back that out". It's obviously going to be in the news when that doesn't happen though.


There are circuit breakers. AWS has plenty of “soft limits” that you have to request to increase including the number of EC2 instances you can spin up.

And yet, the simplest circuit breaker, total spend, is well hidden if it exists at all, judging from all the horror stories.


Those are alerts though, not circuit breakers. (Although alerts can be used to trigger certain types of circuit breakers.)

What some people are asking for--and it's a reasonable use case but one that AWS, somewhat understandably, isn't really focused on--is: "Burn it all down if you have to, including deleting databases, but no way no how charge me a penny more than $100/month (or whatever my set limit is)."


The circuitbreaker everyone would want is probably: "Drop all compute with its ephemeral storage, when the cost of every resource already used and the projected cost of my stable storage (EBS+S3+Glacier+Aurora+Whatnot) is greater than my preset $MONTHLY_MAXIMUM.

That means any data getting lost as a result of that limit is data that they weren't guaranteeing in the first place. You might not be able to actually read your EBS volumes, S3 buckets or Aurora tables without increasing the spending limit or otherwise committing more funds, but it won't go away that second, and you would have enough time to fix it (worst case - wait until next month; you did budget that already).

Alternatively: assign each resource to a pool, and monthly spending limits to each pool. Give your EBS/S3 $1000/month, and your R&D-pet-project-that-may-accidentally-spawn-a-billion-machines $50/month.


Projected cost of persistent storage is still tricky. But, yeah, something along the lines of "Turn off ongoing activity and you have to true up persistent storage soon or it goes away." I don't think one would actually implement a system where a circuit breaker immediately and irrevocably deletes all persistent storage.

And, as you say, it makes sense to have a pool--whether it's your whole AWS account or not--where you can turn on the "burn it all down" option if you want to.


they already are.... but you also need to put a bit more effort into it.

For example, you shut down the ability to launch anything in a particular region easily - but assume you specifically want to exceed a default limit - you can call speak with your rep and have any of your limits set to whatever you want


The argument I've heard against this is that for certain types of spending, cutting things off causes data loss. Automatically doing that is also a foot gun.

The proposed solution is usually to setup billing alerts so you can detect the issue early and fix the problem in a way that makes sense.

I'd suggest further: new AWS (azure, etc) accounts should have billing alarms at $0, $10, $100, $1000, etc. by default on account creation. Users can delete if they don't want them or want something different. Getting an alert at $100 as it happens instead of getting a >$1k bill at the end of the month is a much better customer experience.


They could include it as an option with a big red warning advising against it.

Depending on what you're doing, data loss might not be nearly as big of a threat as a massive bill.

I could also imagine it being configurable on a service by service basis to mitigate against the data loss downside - e.g. maybe you have a hard cap on your lambdas but not your database snapshots.


the solution is "you do you". Allow cloud customers to choose the behavior of whether they're good with just alerting or if they want a circuit breaker to act, or do both. I am sure there is an upper bound of spending where if you were asleep and not reacting to notifications, you'd be better off having the system die instead of kill your company or make you bankrupt.

Isn't that where we are now? Choose your desired behavior seems equivalent to "here's billing alerts and AWS API, good luck".

Someone could build a cost circuit break lambda function fairly easily. Wire a billing alert to the lambda, use AWS API to terminate all EC2 instances (or other resources). Someone could open source that fairly easily.

I think it's about reasonable defaults. You can recommend that customers configure their accounts with the behaviors they choose, but clearly many aren't using the tools they currently have. So we have these horror stories.

I will mention that I don't think deleting resources is a good default behavior as that can cause data loss (itself a potentially business ending event). But people are certainly able to evaluate their own risk and can implement that with the tools that exist today.


I think a better way to look at this is how much money banks make on overdraft fees (Even very small overdrafts). It’s just such a easy way to make money.

FYI all limis are maleable - and you CAN request limits increase/decrease as you desire - you just typically cant do it in the console - but you can speak to your rep and have it done

Too bad you can't segment it to have a hard limit on the non-essential stuff. So at least you can serve the webpage saying what you are or something.

I was typing the exact same solution when I saw your response. Completely agree that this should exist.

So you set a hard limits then suddently storage doesn't work anymore, queries don't get routed etc ... really bad and weird behaviour ensure.

"We went bankrupt overnight due to a runaway script" is "really bad and weird behavior"...

There is no such a thing xyz went bankrupt because of cloud over provisioning. First of all we don't have the whole story from OP and I suspect he's not telling everything, second AWS will cover those accidents, third by default you can't create many resources on an account with default quota.

AWS may cover those accidents, and you can get the default quotas raised quite a bit. I agree that some info is probably missing here, but it's not entirely implausible.

sure, but you'd imagine basic alerting to tell you the circuit breaker in action right now. Depending on your application, bad behavior is better than a dead company.

Azure has this, budgets.


AWS is pretty flexible on first-time mistakes in a particular account; I've personal knowledge of several $5-10k "whoops" bills being nulled out.

There may be a maximum threshold for this, though.


Can attest, once racked up $10k AWS bill due to a silly mistake, got it nulled. That was a great lesson about how fast things can go wrong with pay-as-you-go pricing if not monitored.

Can you share some details?

It's embarrassing to write that now but I accidentally left private key for EC2 instance publicly available on GitHub. And I think what happened is that a bot scraped that key and used my resources to mine Bitcoin.

This seems quite common. I have heard several stories to this effect. Faulty firewall settings or keys committed to the repo seem to be the common two.

I can confirm. I worked for AWS and we always waived all fees when people made mistake like this. Many thousands of dollars.

We also asked them to summarize what happened so we can think how to help other users not to make the same mistake in future.


I wish people would stop saying people "always" have fees waived. It's absoloutely not true.

I had a personal project, where I wanted to occasionally do short but highly parallel jobs. Once my scripts didn't close everything down correctly and a week later I had spent £600. That's a lot of money to me personally. I asked politely and it was never refunded.


You are right. We were responsible only for one service out of hundreds. Different teams have probably different rules.

counter-experience: I worked for a 7 person company where the AWS admin accidentally spent $80k on some VMs to run a job over a long weekend, because mis-read the pricing by machine-hour, and it didn't warn him how much he was spending when he turned them on. Yes, he could have set up alerts, but people fuck up all the time. Our daily usage went from $300 to $27000 overnight. We spent over a year trying to convince AWS to forgive part of the bill. They did not. We went on a long term payment plan. It sucked a lot.

So we switched to Google Cloud, which has a better UI for telling you how much you're about to spend. As we grew, we ended up spending way more money on GCP than we ever did on AWS.


Please do share what it was and why it happened, you can't leave us like that :) !



I have been on the saving end of some of these AWS mistakes (billing/usage alerts are important people), but alerts aren't always enough to keep them from happening completely if they are fast

- Not stopping AWS Glue with an insane amount of DPUs attached when not using it. Quote "I don't know I just attached as many as possible so it would go faster when I needed".

- Bad queueing of jobs that got deployed to production before the end of month reports came out. Ticket quote "sometimes jobs freeze and stall, can kick up 2 retry instances then fails", not a big problem in the middle of the month when there was only a job once a week. End of the month comes along and 100+ users auto upload data to be processed, spinning up 100+ c5.12xlarge instances which should finish in ~2 mins but hang overnight and spin up 2 retry instances each

- Bad queueing of data transfer (I am sensing a queueing problem) that led to high db usage so autoscaling r5.24xlarge (one big db for everything) to the limit of 40 instances


WTF?

I had a new employee dev include AWS creds in his github which was pulled instantly by hacker bots that launched a SHITTON of instances globally and cost $100K in a matter of hours...

It took my team like 6 hours to get it all under control... but AWS dropped all charges and we didnt have to pay any hosting costs for it....

So why didnt you work with AWS to kill such charges?


I mostly have an Azure background but we absorbed a country that ran on AWS. I did an audit of all of their infrastructure this fall, and found they were over spending on AWS by 90%! I wish I had been tasked with it sooner, it could have saved the company hundreds of thousands, that's a few new hires, I was shocked at how mismanaged it was. It seems like the person that set it up was not familiar with cloud pricing models. For example, there were 100 or so detached disks that hadn't been touched in a year or two, which of course isn't free. The instances were too big, etc. I've always found Azure's billing to be easier to maintain cost controls over (at least it feels more friendly.) I wonder how many companies unintentionally over spend on cloud services because of the lack of understanding of the pricing models.

You absorbed a country?!

Mistakes happen. I’ve never heard a case where an honest mistake like that was made and with a simple email to AWS, they wouldn’t waive the charges.

Might be true with AWS, but I had to go back and forth with GCP for days and threaten them with a blog post proving that their budget alert arrived 6 hours late (with our response in <15 minutes) in order to get refunded. Still pissed.

There are reasons that Enterprises don’t trust GCP. Google doesn’t exactly have a sterling reputation for customer support. Anyone who stakes their business on any Google service should know what risks their taking.

If their reputation isn’t enough to keep you from depending on GCP, their hiring of a lot of people from Oracle to beef up their sales staff should be a major warning.


Not only that, remember those mysterious "Google layoffs" back around Valentine's Day, before this virus mess? The ones they were so secretive about?

That was one of the new Oracle execs firing the entire Seattle-based cloud marketing team.

The reason for firing those people was solely to open up headcount to get more salespeople. The marketing team was not happy, for obvious reasons, and company PR worked pretty hard to spin this one. This is not how Googlers expect the company to work. But it is exactly what Oracle refugees expect.

My take: GOOG is in for some Ballmer doldrum years of its own. They've well and truly arrived for the employees, but Wall Street hasn't quite figured it out yet.


And guess what the first thing to get cut is during a recession - advertising. Not to mention if VC funding dries up not only does it affect advertising budgets, it’s mostly startups that are crazy enough to go with GCP. Most major enterprises who are going to start a major migration would go to AWS or Azure.

This might be a YMMV thing. I (well, someone in my team, but that's still on me right?) accidentally burned through ~£150,000 in 3-4 days on GCP, and GCP Support was quite straightforward (and indeed quite helpful through my extreme distress at the time) in refunding me the charges.

same here: AWS refunded extra charges for unintentional mistakes, GCP went ahead and charged me without bothering to listen.

Are there any major cloud hosting/computing providers which do provide a hard spend limit?

We've heard the horror stories from the Google data egress pricing "surprises" (like that GPT adventure game guy incurred a few month ago https://news.ycombinator.com/item?id=21739879).

We've heard the AWS and Azure horror stories.

It seems crazy that the only hope of correcting a mistaken overspend is a helpful support desk. The first one is free, right?

At least AWS does have such a support desk, Azure may have and with GCP you are better off just shuting down the company.

How about lesser providers such as Digital Ocean.

Let's say your code mistakenly provisions 1000 droplets instead of 100. Is this a scenario you can prevent at an admin level?


Amazon doesn’t really want to kill startups using AWS even when they fuck up. If you had made some phone calls / worked LinkedIn / Tweeted / written a blog post you could have gotten that refunded in a week.

But you need to know this is possible in the first place. And once you've had to pay the price for it it's hard to ever feel safe using it again.

A lot of times your AWS rep can help with stuff like that. They are more interested in ongoing revenue then a one-time score that will end up going to collections. I had a situation where someone spun up an instance with an AMI that cost $$ per hour to run, then decided to go home early and disconnect for the weekend, so Monday morning I noticed we had been billed $22,000 for an idle instance running the AMI. It got handled. No worse then the kid who bought ten grand of “smurfberries” on his dad’s phone.

That's the one aspect I liked. Never worrying about what the bill will look like. And with php it feels like it provides the best value / resources / cost.

This is why I advise clients to setup billing alerts delivered to management (not tech team) as one of the first things todo when adopting a pay-as-you go technology [1]

[1] https://www.futurice.com/blog/mastering%20bigquery%20costs


For startups that "might take off" I would use a dedicated virtual private server as a good alternative to unreasonable bills due to provisioning mistakes. You could be spending that time coding rather than than figuring out if you're making a mistake provisioning. There are too many stories like yours out there.

Here is my referral code for the one I use:

https://crm.vpscheap.net/aff.php?aff=15

(I previously asked Dan, the mod here, if I can share in this way and he said it's okay. I don't have other affiliation with that company and have found it good.)


I dont understand why amazon does not provide hard billing limits, like reach x usd then shutdown everything, just for emergency.

because it's not in their business interests to do so?

Scary. Does AWS allow one to set a hard price limit on your account today?

That's the reason I never went with cloud hosting. No kidding.

If shared PHP hosting was good enough for your use case why on earth were you on AWS (and running 4K video encoding clusters!) in the first place?

? You should call AWS. They will negotiate. If it doesn't work out -- I think you should file a dispute with the credit card company. They will probably back you and want AWS to negotiate. If none of this happens - threaten to declare bankruptcy. Either way, AWS will back off. There is no money in it for them. IN NO WAY, should you pay the 80k bill and shut down your company.

Yes, you're legally obligated to pay the bill. You can't decide you don't want to pay because you don't like the cost.

Claiming an unfounded dispute or transferring funds to a new company is fraud and you'll probably end up with both AWS and your bank coming after you for collections. With $80k on the line, its enough to file legal claims.

The best plan is to negotiate directly with AWS and ask for forgiveness and a payment plan. Do not try to run away from your financial obligations or you will make it far worse.

EDIT: you've rewritten your comment so this doesn't apply, but please don't recommend that people avoid their debts.


"Yes, you're legally obligated to pay the bill"

Yes in civil court. So, no don't pay it. This isn't London 1689. Debtors prisons do not exist.

"Claiming an unfounded dispute or transferring funds to a new company is fraud"

Explaining to the credit card company you were tricked or confused in this purchase is not fraud.


1) Civil vs criminal does not change the fact that incurred bills are a legal obligation.

2) AWS did not take advantage of you and making a mistake does not absolve you of responsibility. There's nothing to dispute.

3) Bankruptcy is allowed, and is also exactly what happened. You stated other things like filing fake disputes and transferring funds to a new company, which is fraud. And that does come with criminal charges.

EDIT: to your completely revised comment - Bills are still owed, even if it's only civil, and judgements can result in wages, taxes and other assets being garnished. Saying you were "tricked or confused" when you weren't is fraudulent, and credit card companies are not going to defend you from that. Unless AWS forced those charges or failed to deliver services, there's no dispute.


"AWS did not take advantage of you"

How do you know? That is what a court system is for.

"like filing fake disputes and transferring funds to a new company"

Ah the old straw man. Nope. I didn't say file fake disputes.

"please don't recommend running away from debts"

Having the ability to not pay debts is the entire point of Limited Liability Companies. People out of human dignity should have the right to NOT pay debts. Please don't recommend paying whatever a debtor wants.


The poster clearly admitted what happened. Mistakenly running up a large bill doesn't clear your responsibility to pay that bill and knowingly filing disputes or changing companies is fraud. You can definitely go to court but without clear evidence you will likely lose and then owe even more.

Since you're revising your comments, there's no point to further discussion but please don't recommend running away from debts. That's not going to end well.


I'm not a lawyer, and this is not legal advice: You'd be surprised. There's often enough give and take, enough ambiguity, in contract law and/or in any given contract such that disputing a debt is not a further wrong (criminal or civil). But you might be on the hook for interest and damages resulting from the delay if you lose.

You don't have to be further wrong, but going to court isn't free so you've only increased your total costs at the end if you lose.

Yes, of course. But you were suggesting it was necessarily fraud. I'm just pointing out that that's not necessarily the case.

Knowingly filing a false dispute or creating a new company to transfer funds to get out of paying a bill is absolutely fraud. Where is that not necessarily true?

A person might aspire to be responsible for a bill like that, but that doesn't make it ethical or good business for Amazon to refuse to waive or reduce it.

Familiar with The Merchant of Venice?


That's a separate topic. And it doesn't mean you should ignore your debts as the other poster was saying, because that's also unethical and possibly fraudulent.

Nobody should ignore their debts. I think maybe you should just ignore whoever you think was suggesting that, because that's obviously impractical/self destructive and not worth debating.

That was the original comment that started this whole thread. Maybe you should reply to them instead of telling me what not to do hours after the conversation is over.


Isn't it a well accepted fact that cloud pricing is opaque? Does that not leave the discussion open to an argument that in lieu of not understanding the pricing of multiple interconnected services, it is very difficult for a user to make informed decisions such that perhaps not all of the liability is their own?

It's not opaque. It's actually very transparent and well-documented. The issue would be complexity, but that's going to be a very difficult claim considering that you weren't forced into using any of it.

If you cannot pay, negotiate with Amazon instead of stiffing them after agreeing to the TOS.

I'm not a lawyer and this is not legal advice: disputing debts is not by itself fraud, even if you end up losing. If you have some "colorable claim" (i.e. some basis in law and fact to think that a court might plausibly rule in your favor), then you are in your rights to test it in court. But don't be surprised if upon losing, you are forced to pay interest and/or other damages accrued due to the delay.

Credit cards allow for disputes when there are problems with the transaction (fraudulent seller, not honoring terms, not providing services, etc). It does not cover you mistakenly buying what you don't need.

Filing a dispute when you knowingly made a mistake is a bad move, and your bank will quickly figure this out when AWS provides the billing statement, API logs and signed TOS. You're going to have a very tough time if you try to litigate this in court.

Debts (or at least payment plans) can be negotiated. Disputing to weasel out of them will only make things worse. A little communication can go a long way.


Accidental overspend is probably a big part of cloud revenue. When you have an AWS account being used by 6 dev teams with their own microservices, how does anyone know whether you're paying for resources that you don't need? Very few people even understand how to create a cost-optimized setup for their own project.

Transferring money out from a company to hide from creditors is probably illegal.

We read horror stories in the media because they make good stories, but in fact, people fuck up all the time in business and then it's just reversed because in most cases, nobody is irrationally out for their pound of flesh. People fat finger multimillion dollar trades on Wall Street and while I don't know that it's guaranteed to work out, I definitely have read about instances of that being reversed.

If cryptocurrency and smart contracts make sense to you, you might not be aware that forgiveness for human error really does happen in normal business.


How does AWS collect on the $80k debt? If you're a startup, you could cancel/freeze your credit card, dissolve the LLC / S Corp, set a new one up with a similar name, and transfer all IP assets over. Poof - all debts and liabilities erased.

What's wrong with this approach? It's not like they can collect on your personally, or go after the new company. (I wonder how they would even figure what legal entity is behind the new company/wesbite.)


You'd run the risk of a court deciding to pierce your LLC's shield.

https://www.nolo.com/legal-encyclopedia/personal-liability-p...


I read that, and the key thing to pay to attention is whether you willfully did something that was “unfair”, “unjust”, and “fraudulent”.

I don’t think those apply here. If by sheer accident you were hit with a giant AWS bill, and you were facing potentially having to shut down your company, and you conducted the maneuver that I described, what’s wrong with it? Your company was facing a life-or-death situation, and decided to be reborn.

Maybe there needs to be a form of corporate bankruptcy where the company can retain its core/key IP assets...


It's not like they can collect on your personally, or go after the new company.

That is not a safe assumption to make, especially if you are deliberately (AKA fraudulently) dodging debts (IANAL).


I’ve addressed this in this reply to a sibling comment stating that the LLC/corporate shield could be pierced here: https://news.ycombinator.com/item?id=22734033

Are you sure this is without any potential for legal troubles or credit rating damage that could proove as costly as just resenting but paying the bill and move on. I do not have this experience or first hand hearsay of this situation. You had this experience yourself and have an anecdote to share with us? Or this is common knowledge that I should know

Azure has, built in, hard price/cost limits but doesn't allow the public to use them. For example if you have MSDN subscription credit you get a hard limit of up to $150/month, but you yourself cannot pick a bespoke limit to use the service more safely.

Kind of makes me annoyed. I'm sure enterprises don't care/want unlimited. But solo practitioners, and people new to the platform would love a default e.g. $5K/month limit (or less).

Feels like these services just want people to "gotcha" into spending a bunch of money without simple safety nets.

PS - No, alerts do not accomplish the same thing, by the time you get the alert you could have spent tens of thousands.


This is only anecdotal, not personal experience, but I've read online and have had friends "oops" away large sums of money on AWS, and for the most part they seem to have at least gotten a partial discount when they contacted customer support.

I strongly suspect opaque pricing and high/nonexistent limits are more about getting large organizations to transition to the cloud seamlessly (i.e. not completely caring/realizing what they're getting into for any particular migration/deployment).

Tricking personal users into spending thousands by accident probably doesn't net much money compared to enterprise spend and runs the risk of alienating people who then can go into work and recommend against using a particular platform, having been burned by it on their personal accounts.


> “oops" away large sums of money on AWS, and for the most part they seem to have at least gotten a partial discount when they contacted customer support

As a counter-datapoint, we accidentally left a Redshift cluster up idling for two weeks before we started getting alerts, and after numerous attempts have failed to get compensated in any way. The reasoning was that, well, it was what we requested and they had to allocate compute power to it (which we didn’t use).

All in all a very frustrating experience and it makes me fairly cynical of all these “I got my money back without problems!” comments.

(For what it’s worth, it was about $4k of costs which was a lot for us at the time)


It's also unnecessary. Give users cost controls, then you won't need the current mess of hoping support will write off $$$ of mistakes. With a risk of bankrupting a small shop if support doesn't help it drives risk averse users towards less dynamic offerings.

Isn't AWS supposed to focus on the virtuous cycle of saving customers money (or at least reducing AWS supports need to write off customer mistakes)?


I always have the feeling of a bit of “randomness” with these kinds of compensations. It makes sense, as it’s difficult to “codify” these types of things, lest they get abused and you might as well just lower your prices at that point.

AWS is a large organization; I believe this type of stuff highly depends upon your “entrance” into the organization, i.e. the account manager. We were probably just unlucky with our Redshift troubles, but it did eventually trigger a move to Google Cloud / Bigquery, as the pay-as-you-go method seemed a bit safer (although it’s still too difficult imho to accurately estimate the costs of queries).


similar story here, but with $80k, which almost killed the company. woot.

A current coworker used to work at AWS and shared an anecdote. Makes sense but I’m surprised I haven’t seen it repeated:

nobody wants to work on the billing code because it’s a mess and the penalty for a mistake is very high.


I, for one, had one of those "oops" when I misused some cloud services, making a VPS instance accessible from the network.

When I got charged an extremely large amount I was contacted by the customer support. They explained to me what happened (I had leaked a SECRET in my repo) and then got refunded the total amount.

It was quite an anecdotical experience because I wasn't expecting any of it as it was my mistake.


I inadvertently committed SES creds to GitHub. The only way I found out by an email from AWS telling me that the creds were suspended 24 hours later for a high reported spam rate.

Someone had sent 70,000 emails (which is the default daily limit at my tier). Luckily only cost ~$8.


How do people keep doing this? It’s stated over and over again in all AWS documentation never to put access codes in code or your configuration. Locally, your keys should be stored as part of your user profile (configured with the AWS cli) and when your code is running in AWS, it should get permissions based on an IAM role attached to your EC2/ECS/Lambda.

My guess would be poorly configured .gitignore files. I recently had a project collaborator commit his .env file containing credentials to a MongoDB cluster because of this.

if you follow the guidelines that are repeatedly stressed, your code should never be reading or handling AWS credentials directly. That wouldn’t be an issue.

The access keys would be in your usr directory and all of the SDKs would know how to find them. When running on AWS, the SDKs get the credentials from the instance metadata.


This has happened to me when I was a student. I didn't understand EC2 billing cycle well enough and ended up using much more than my student credits. The final bill amount was high as a student (but still less than $1000). Contacted AWS support, they waived the charges but cancelled my account and told I couldn't use that email address anymore for AWS.

Or just not being organised enough to calculate process realistically, that's got to be pretty difficult to do

It's their loss. I have moved off AWS for this very reason. My pet project cannot involve the risk of possibly costing me thousands or more because I made a mistake or it got super popular over night. I'd rather my site just 503.

I know it depends on the site/app, but for a hobbyist, what is the biggest gotcha in a "got super popular over night" situation? If I look at the quotas and overages for low-end plans from the following providers, e.g., it's not obvious to me where the realistic bottlenecks are:

* Firebase Hosting with Firestore

* Cloudflare Workers Sites (using KV)

* Netlify (possibly w/ FaunaDB)


I'd rather my pet project returns 503 after going popular overnight than me footing a huge bill. Especially since my pet projects generally don't generate any revenue anyways. This is the most important feature for me and why I went with GCP.

You know what's going to be cheaper and simpler? Get some vultr VPSes. Maybe one for your web server and one for a PostgresDB, and another for a Redis if you need it.

Done. For 98% of hobbyist projects, a single Vultr $5/month node is probably far more than enough. For 99%, three Vultr $10/month instances (web, DB, cache) is probably enough.


Billing alerts + Lambda. It's not two clicks, but there is plenty of CF templates.

The best solution to avoid huge accidental AWS bills due to mismanaging AWS services yourself is to manage your own AWS billing alert service?

Yes, because without them knowing where your infrastructure can be killed/what can be deleted in order to reduce costs without completely destroying your business there's no way for AWS to do this for you.

> without them knowing where your infrastructure can be killed/what

So add an interface that will let you specify that somehow for common scenarios? There must be something better than zero help they can offer. Not everyone needs something that can autoscale to Google levels.


I think the idea is that they do help you. They provide alerts and and APIs that can be used to programmatically control all of your infrastructure. So in a sense having a Lambda listen for billing events and respond in a way appropriate to your particular organization may be pretty close to the best solution.

If your credit card bounces, I'm sure they'd have no problem killing your infrastructure

I'm sure at some point, but at that point you're no longer really their customer and I'm sure they're less worried about not completely destroying your work or livelihood.

I've actually had my payments on my personal account bounce once or twice and no, they did not.


Alerts are just alerts. I don’t and can’t monitor alerts 24/7.

The alert automatically calls a Lambda function which turns off all your services.

Which puts a burden on the user because this needs to be tested. Also, it's not 100% safe because the user is still accountable.

Since when was a user being accountable a bad thing?

Or they could just offer a limit ;)

How do you auto test that this actually works and continues working properly

Set a low limit then scale up.

That's why you have the Lambda there to scale it down. You don't need to sit there.

Oh okay, I misunderstood.

there is plenty of CF templates

This by definition is deploying something you don't fully understand. If there's a problem in any of those templates you won't know. You won't really know if they even do what they say they do.

Using one to do something as important as this would be crazy.


I don't think he's saying that you should blindly deploy this stuff. But you don't have to create a solution from scratch. There are existing templates out there that you can leverage to build your own solution

Billing alerts are not real time; far from it, actually.

So set a lower limit

Then you need to be on call 24/7 if someone happens to DDOS you.

That's why you have the Lambda there to scale it down. You don't need to sit there.

Auto-self-DoS is what they were getting at!

At the risk of bringing up the dreaded name here, these kinds of billing shocks were one of the problems Oracle wanted to solve with Oracle Cloud Infrastructure (OCI). So it has been built from the ground up with variable limits and quotas in mind. Every service has them from the outset so that customers can control their maximum expenditure. When they started out building OCI, the major clouds weren't offering this as a key feature.

Enterprise companies do not want infinite billing. They want fixed and reliable billing, more than anything else. With on-prem equipment they know a few years in advance what their expenditure is going to be at any time, and will have a budgeted amount over the top of that that they're on-board with. Bring the idea of autoscaling with limits, and they're very happy indeed, particularly with the idea of automatically scaling down.

> Azure has, built in, hard price/cost limits but doesn't allow the public to use them. For example if you have MSDN subscription credit you get a hard limit of up to $150/month, but you yourself cannot pick a bespoke limit to use the service more safely.

I would be willing to bet that that is something enterprise customers can get access too, particularly if their annual expenditure is high enough under normal operation. Microsoft knows the enterprise market very well, just like Oracle does, and like Amazon doesn't (historically speaking, at least).


Yeah sign me up for being annoying at this. Several times I'm just like: "why the hell can't I just pay for this thing up front and know what I'm spending." Then there's also both Google Cloud, AWS, etc not letting you spin up certain machines because of "quota limits" which you have to apply to raise. It's like: WTF? do they want your money or not? Idk why it's designed like this but it's a horrible experience.

It's designed to limit the damage when people put their AWS admin credentials on GitHub or in their Android app and someone uses it to mine Bitcoin :)

There is significant regulation for prepay of things in many jurisdictions. Eg gift cards, and it is likely that the cloud businesses do not want to enter that mine field.

Yes some of them have gift card programs already, but they probably don't want expanded regulations to contend with large sums of money.


> Feels like these services just want people to "gotcha" into spending a bunch of money without simple safety nets.

Because it is just like that. Nine years ago Amazon said (about the same issue raised 14 years ago): "We’ve received similar requests from many of our customers and we definitely understand how important this is to your business; We hear you loud and clear. As it stands right now, we are planning to implement some features that are requested on this thread, though we don’t yet have a timeline to share." [0] In other words, they know people need it, but they prefer not to implement it.

[0] https://forums.aws.amazon.com/thread.jspa?threadID=58127


There are other ways to setup guard-rails in Azure - policies are one such feature:

https://docs.microsoft.com/en-us/azure/governance/policy/tut...

They may not necessarily enforce spending limits - but it's possible to restrict provisioning of costly resources, or even whitelist resources that can be provisioned. Almost every Cloud Foundation project nowadays involves setting up these guard-rails.


Having reliable hard limits for production accounts can be technically difficult as you need to do billing in real-time and also make decisions on what services to kill once the limit is reached. Do you just stop VMs, do you automatically delete data from storage. Many things could result in loss of production data.

There can be also many reasons for the budget overrun. It's not always a user error. It could be issue with the platform itself such as error in billing system or faulty autoscale logic. Or it could be caused by an external event, such as denial-of-service attack.

(Not sure how things work with the MSDN subscription credit, but at least you are not supposed to be running production workloads with those)


Just because the decisions are difficult to make doesn't mean there's no need for this feature or ways to implement it.

> people new to the platform would love a default e.g. $5K/month limit (or less).

I ran into that issue when I wanted to play with AWS EC2 (few years ago, maybe it has changed since then, or maybe I didn't look hard enough). The free VMs were too slow to be usable. Considering my usage, I was unlikely to run into un-expected spendings, but I didn't want to take any risk. Can anyone recommend a similar service with a simpler customer interface where you can set up a simple safety spending limit?


If you're just looking at VMs you'd probably be better off with something like linode or digital ocean, and get flat monthly fees.

Though amongst those service-types, I can't really recommend beyond the fact that linode & DO didn't give me any headaches for the one month I used them


> If you're just looking at VMs you'd probably be better off with something like linode or digital ocean, and get flat monthly fees.

Which means you won't learn AWS/Azure/etc instead, and they lose mind-share. This is actually an argument for why they SHOULD offer hard limits, not an argument against.

If their goal is to push startups/newbies/hobbyists to other platform, they're definitely on the right path. If the goal is to make their cloud services safe to learn/start using, then they could do much better.


Yeah, for my personal projects I always stick to flat fee hosts like Linode and only ever used AWS for some backup storage in S3, and GCP for a geographical region that Linode doesn’t serve well. And whenever I use the big clouds I get paranoid and have to check billing & usage very often since I’m always just one oops/DDoS away from incurring a large bill, as opposed to the flat fee hosts where I leave shit running for months or years at a time without worrying. (FWIW Amazon Lightsail might be a flat fee service, but I heard performance is pretty bad so never tried it.)

I think even Lightsail can expose you to data transfer overages although I don't know how large a bill those could realistically add up to.

Yes, Lightsail’s egress overage fee is the same as EC2’s crazy egress pricing (at least $0.09/GB), whereas Linode charges me a much more reasonable $0.01/GB if I go over.

Newbies/hobbyists shouldn't be using aws/azure over digital ocen/vultr/linode unless their hobby is learning aws/azure. Most startups shouldn't either.. if you can't afford to hire an aws/azure expert you shouldn't be using it. You are probably doing it in a way that will cost you in the future.

> if you can't afford to hire an aws/azure expert you shouldn't be using it.

Your logic is a self-contradiction:

- You need an expert to use AWS/Azure

- It is unsafe to even learn AWS/Azure without already being an expert.

Where do these experts come from? Osmosis? If there's no safe way to learn them, and being an expert is a prerequisite to using them, then you've created an artificial self-limiting supply shortage.

This is another argument that defeats itself and shows that these limits are absolutely needed to stop a mindshare loss/lack of expertise.


Where do these experts come from?

In my case, working for a company that gave me admin access from day one with no practical experience with AWS.

Even though I haven’t done anything stupid (yet) and think I know enough not to now, I wouldn’t recommend that....


I still worry about someone getting into my account. The largest insurance would run $2,240/month, and you can spin up 25 of them no questions asked. Plus there's Spaces, backups, snapshots.

My own mistakes are probably a greater risk, but still. Turn on that 2FA.


I have a prepaid account at Aruba Cloud for my VPS, zero risk. Just top up when necessary.

One of the original points (though they've expanded in capabilities since then) of cloud services is that they're pay-per-use and can scale up and down as needed. Of course, that cuts both ways. If you mostly just care about compute, you probably just want some traditional hosting service with bandwidth caps (rather than transfer overage charges).

AWS LightSail.

Once you set up a subscription service and are selling that service to a huge number of people/companies, rarely that you or one of your salesmen don't want the clients to spend a large sum of money on the service unintentionally and then discount partially as a goodwill. Have seen it in sales of all kinds of services. It's just that they have different tricks.

How does that work for stateful services like S3? Should they just delete the data? (Which for some people may in fact be what they'd want.)

I do realize you can get closer to a hard limit while possibly exempting some services that would let you get over the limit--I suppose. Though then people would doubtless complain that the hard and fast limit is not, in fact, a hard and fast limit.


> How does that work for stateful services like S3? Should they just delete the data?

Intuitively, if you're capping out your S3 storage, the hard cutoff should look like "don't allow me to store any additional data".

If you're capping out retrieval, then "don't serve the data any more".


Yes for retrieval. Just don't serve it.

But if the data is stored, the clock keeps ticking until you delete it. If I have a TB of data stored, and I hit my $1K (or whatever limit) on April 15, the only way that I don't get hit with a >$1K bill for the month is if AWS deletes anything I have stored on the service. (Or at least holds it hostage until I pay up for the overage.)


You can easily calculate what the bill will be at the end of the month if no new data is stored or deleted between now and then. So if you need a hard cutoff for storage, use that.

There's enough room there for workflows where I know I'm going to delete data later that allowing configuration would be valuable. (Maybe I can set a timed expiration at the moment of storage, instead of having to store first and separately delete later? That would keep end-of-month predictions accurate.) But it isn't difficult to set the hard cutoff.


So then your AWS/Azure service is turned off April 2 because you had some temporary spike in uploads?

What you're asking for is not possible and will have unintended consequences. Guaranteed not to meet every customer's expectation of how it works.


> So then your AWS/Azure service is turned off April 2 because you had some temporary spike in uploads?

Yes, that's the idea. Compare https://news.ycombinator.com/item?id=22719015

>> My pet project cannot involve the risk of possibly costing me thousands or more because I made a mistake or it got super popular over night. I'd rather my site just 503.

> What you're asking for is not possible

How so?


> How does that work for stateful services like S3? Should they just delete the data?

No, for services like these it should cap at the cost of keeping the data indefinitely. If your budget limit for S3 was $1000 per month, and you tried to add an object which if not deleted would make you use $1010 next month (and every month after that), it should reject adding that object.


We've got processes that push massive files into S3 for a later to stage to then stream out, and delete when they've completed successfully.

So now we've created a situation where everything's running fine, our bill is consistently $500/mo, I go casually turn on a $1k/mo spending limit... aaaaand suddenly everything starts failing in totally non-obvious ways.


Hmm... I'm a big fan of fixed-budget or prepaid services, but between network, storage, VM costs, etc., what should the provider stop serving if you exceed the spend? Create an outage for your whole service? Throttle egress? Start randomly killing VMs?

I think for their hard limit on MSDN funds, MS shuts down everything that creates incremental costs. As far as i know they don't delete anything, even things like storage that have an associated cost.

That said, i am pretty sure the the TOS for MSDN funds say they are not be used for production systems.


There are also individual limits on number of cores per account, etc - something we ran into when we needed to quickly scale on Black Friday, and the support too forever to get back to us even with priority A ticket.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: