Metal Storm logo
Metal Storm's downtime



Posts: 25   Visited by: 134 users
23.03.2024 - 01:27
corrupt
With a lowercase c
Admin
Originally I thought I would abuse the news for this, but it seems better placed here.

So what happened?

On 4 a.m. server time (which is CET right now) on Thursday the server started stalling. CPU usage constantly up to 100% which made the site slower and slower, to the point where it wouldn't respond at all anymore.

Usually when this happens, it's because of database issues. As most of you will know, this site has existed for over 20 years and we're carrying a lot of legacy sadness, both in code and database design. Many decisions that were made in the early 2000's were made with the limited hardware of the time in mind and don't translate super well into modern technology. Short of a complete re-write we'll likely never shake that fully, but we're a hobby project run by volunteers in their free time, who also pay for everything. So having little resources is fine. It's part of the fun in a way as we're forced to make a lot of compromise and find creative solutions to problems that the industry has long since solved, but with solutions that are usually not affordable to us.

This time, however, it was not a database issue. It was the database that stalled, but the cause was somewhere outside of our control. I don't want to bore anyone with technical details - and, frankly, we don't know ourselves exactly what happened. But to take some pressure off the site and to be able to investigate, I put the maintenance page up instead.

Ivor and I spent the better part of the last two days trying to figure out what the exact cause is. We have a few theories and are in talks with our provider to find out more. Could be a misconfigured crawler, could be a targeted attack, could be a provider issue. We'll find out when we find out.

We did make a few changes to the site while we were at it. The most visible will likely be a rate limiting solution. If you refresh the site too often in a certain window of time, you will be redirected to a landing page with an HTTP code 429 "Too many requests". It's a sliding window so after a while you'll be able to reload and should be right where you wanted to go. We're running this as an experiment. I have the same system with stricter settings in place for web crawlers, who usually respect these and have increasing back-off windows to compensate.

But rate limiting is also in place for everyone here. I picked a window that I feel should allow everyone to use the site normally, but I have learned in the past years since taking over operations here, that you only get feedback when you annoy people. So if you see a lot of rate limiting in what you feel is normal use of the site, let us know. We can always tune things.
I can say, however, that bot Amazon's and Bing's bots are already backing off. They're allowed load a page here once every six seconds now.

Obviously we're hoping that these measures and some more background work will prevent issues in the future, but there will never be any guarantees. The server is still behaving weird and we're still in talks with our provider to figure out why. I also have a few more updates in the pipeline now, that'll require me to take down the site for a short while, that I hope I can get done before April. When that happens, the maintenance page will tell you so. In the meantime, enjoy the site in a hopefully functioning state.
----
Loading...
23.03.2024 - 11:10
musclassia
Staff
Huge thanks to you, corrupt and Ivor, for battling against whatever happened on Thursday and Friday, and for all the work you do keeping this broken site functional. It'll be very interesting to learn what happened if the provider is able to identify the cause, if it's something that can be disclosed; hopefully it wasn't an alliance of Najand and Hercules taking revenge on the site for my reviews of their albums last year
Loading...
23.03.2024 - 11:39
Bad English
Tage Westerlund
Danke shön anf kiitos to you guys.
I hope it won't happen again, good to see our home and fav site is back on the saddle
----
I stand whit Ukraine and Israel. They have right to defend own citizens.

Stormtroopers of Death - "Speak English or Die"

I better die, because I never will learn speek english, so I choose dieing
Loading...
23.03.2024 - 11:41
Bad English
Tage Westerlund
Written by musclassia on 23.03.2024 at 11:10

of Najand and Hercules taking revenge on the site for my reviews of their albums last year

Or Black God waited 13 years for his revenge. Right Barry, you remember
----
I stand whit Ukraine and Israel. They have right to defend own citizens.

Stormtroopers of Death - "Speak English or Die"

I better die, because I never will learn speek english, so I choose dieing
Loading...
26.03.2024 - 14:55
corrupt
With a lowercase c
Admin
Update on all this:

After several days of trying internal optimization and profiling, we finally got an answer back from our provider about some changes they made in the background that correlate well with both the timing and nature of our issues.
We got things worked out by moving to a different type of server instance that is more expensive, too. So don't be surprised if we actually pursue community donations in the near future. We tried running the site on limited resources to limit our cost but it is becoming more and more clear that we cannot continue to do this indefinitely.

For the time being, we should be good. Back to our old performance, if not better for the new hardware alone. But we also did some updates in the meantime. A new database version that should bring a little speed boost for some of the more complex queries and we also spent some time optimizing a few of the more taxing queries, which should improve page loads overall.

We also ditched "Unable to connect to the database" for good, replacing it with a screen saying the site is under heavy load. That should help with our page rank in high-stress situations as Google and other crawlers don't look kindly on pages just not working.
We also still have rate limiting in place. As said before, we are open to feedback on that front, but no promises. Right now our main priority is running the site reliably without outage or major slowdowns.

We will go down for some more maintenance in the next couple of days. But it will be short periods and that is considered planned downtime

If you have any questions or feedback, feel free to leave them here
----
Loading...
26.03.2024 - 15:42
The Galactician
Huge thanks to you guys for dealing with the issues and getting the site back up. I understand site design and maintenance to the extent that you have my deepest sympathies for the inherited madness. That's no easy task or charge, to say the least.

One super minor thing. I never saw a site maintenance message during the downtime. It was just a long load followed by a completely blank screen. Again, for the amount of time I or anyone else would ever see such a thing, probably no big deal, but maybe worth knowing about for the future.

I'm in when the money request happens. Lord knows I've gotten enough out of this site for it to have been earned many times over.
Loading...
26.03.2024 - 15:54
corrupt
With a lowercase c
Admin
Written by The Galactician on 26.03.2024 at 15:42
One super minor thing. I never saw a site maintenance message during the downtime. It was just a long load followed by a completely blank screen. Again, for the amount of time I or anyone else would ever see such a thing, probably no big deal, but maybe worth knowing about for the future.

The maintenance page only shows when I explicitly route to it. Over (German) night I had the site up just to see if things would sort themselves out like they had last night. At that point you wouldn't have seen anything. I put the maintenance page back up when I started working on the site this morning to take the load off. That was around four hours ago.
----
Loading...
26.03.2024 - 15:59
Liafev
Written by corrupt on 26.03.2024 at 14:55

So don't be surprised if we actually pursue community donations in the near future.

Funny you mention this a few days before April 1st considering the great April fools you gave us in the past
Loading...
26.03.2024 - 16:05
corrupt
With a lowercase c
Admin
Written by Liafev on 26.03.2024 at 15:59
Funny you mention this a few days before April 1st considering the great April fools you gave us in the past

Oh, fair warning on that. There won't be anything this year. Not a joke. We just couldn't come up with anything that didn't seem like a cheap knock-off of stuff we did in the past.
----
Loading...
26.03.2024 - 16:10
Liafev
Written by corrupt on 26.03.2024 at 16:05

Oh, fair warning on that. There won't be anything this year. Not a joke. We just couldn't come up with anything that didn't seem like a cheap knock-off of stuff we did in the past.

Fair enough. I mean you know my position on that, I'll gladly donate the moment I'm given the possibility to
Loading...
26.03.2024 - 16:15
The Galactician
Written by corrupt on 26.03.2024 at 15:54

Written by The Galactician on 26.03.2024 at 15:42
One super minor thing. I never saw a site maintenance message during the downtime. It was just a long load followed by a completely blank screen. Again, for the amount of time I or anyone else would ever see such a thing, probably no big deal, but maybe worth knowing about for the future.

The maintenance page only shows when I explicitly route to it. Over (German) night I had the site up just to see if things would sort themselves out like they had last night. At that point you wouldn't have seen anything. I put the maintenance page back up when I started working on the site this morning to take the load off. That was around four hours ago.

Ah! makes perfect sense. I thought you meant it was up during the downtime. My mistake and disregard, of course.
Loading...
26.03.2024 - 16:55
tludmetal
Why not putting your server on aws or any other cloud service? You could leverage a CDN in front of the server, high availability on demand, etc.. and also a waf or any other measure. This web is extremely appreciated by the metal community and could make contributions. By my side, as solutions architect I could provide any help if you need.
Loading...
26.03.2024 - 17:00
corrupt
With a lowercase c
Admin
Written by tludmetal on 26.03.2024 at 16:55
Why not putting your server on aws or any other cloud service? You could leverage a CDN in front of the server, high availability on demand, etc.. and also a waf or any other measure. This web is extremely appreciated by the metal community and could make contributions. By my side, as solutions architect I could provide any help if you need.

Do you have any idea what that costs? AWS in particular will explode super fast. Especially when you're not ready to go microservice for horizontal scaling, which is just not something we can do. Not that I'm keen on managing an AWS account on top of the site. AWS is a shit ton of work. And I say that from professional experience.
----
Loading...
26.03.2024 - 17:15
Ivor
Staff
Written by tludmetal on 26.03.2024 at 16:55

Why not putting your server on aws or any other cloud service?

AWS is good... until it really isn't by which time you're in such deep end of their infrastructure that the mere thought of migrating away costs more money than starting a new business from scratch. I'm taking artist's liberty at exaggerating the situation but to be able to run this site in the cloud proper with manageable costs this site needs to be dissected in ways that makes your skin crawl. Just ask corrupt what he's been wading through for the past two years. He's done tremendous improvements to the code, stuff that I wouldn't have believed possible before he took over. The dinosaur is being revived from the brink of extinction before your very eyes.

I.
Loading...
26.03.2024 - 17:23
corrupt
With a lowercase c
Admin
Written by Ivor on 26.03.2024 at 17:15
He's done tremendous improvements to the code, stuff that I wouldn't have believed possible before he took over. The dinosaur is being revived from the brink of extinction before your very eyes.

Awww

To add to this in a little less passive aggressive way: The site is a HUGE 20 year old monolith. Everything is intertwined, there is no separation of anything, everything goes into one huge database, and whatever you touch will absolutely break something else. This is the opposite of a project that will scale to AWS.
We're on Docker now and that actually has a lot of advantages over our previous bare-metal setup. But going cloud-native from there will not be viable for a long time.
----
Loading...
26.03.2024 - 21:53
Guib
Thrash Talker
Written by corrupt on 26.03.2024 at 14:55

So don't be surprised if we actually pursue community donations in the near future.

Yes! Let me donate ffs. Let us help lol. I've even messaged about this a while back and never got a reply. Some of us would love to help out. We might not all have the time or expertise to work on the site or even contribute to articles and such... But at least we can help you guys unlock ressources and pay for services if need be.
----
- Headbanging with mostly clogged arteries to that stuff -
Guib's List Of Essential Albums
- Also Thrash Paradise
Thrash Here
Loading...
26.03.2024 - 22:37
corrupt
With a lowercase c
Admin
Written by Guib on 26.03.2024 at 21:53
Yes! Let me donate ffs. Let us help lol. I've even messaged about this a while back and never got a reply. Some of us would love to help out. We might not all have the time or expertise to work on the site or even contribute to articles and such... But at least we can help you guys unlock ressources and pay for services if need be.

We're working on it now. It's not as easy as pasting a link to one of our private Paypal accounts. We need to connect the money we collect as donations with the non profit entity that owns Metal Storm and make sure the legalities of it are sound. Give us some time, we'll come up with something.
----
Loading...
26.03.2024 - 23:36
tludmetal
Well, I was not talking about getting a microservices project or leveraging cloud native features. I understand that you are managing a very complex monolith with thousands of untouchable interdependencies. When I try to help recommending to rehost your platform to cloud is to leverage at least some several cloud features of backuping your infra and to reduce RPO and RTO, also getting a CDN on front to cache request and release some requests burden, also possibility of vertical scalling is another feasible possibility at the cloud. Additionally, if you believe that you could get malicious requests, a WAF or some rate limiting controller can help. I just try to help as a metal fan of the really good job that you are doing on this web.
Loading...
27.03.2024 - 00:09
corrupt
With a lowercase c
Admin
Written by tludmetal on 26.03.2024 at 23:36

Well, I was not talking about getting a microservices project or leveraging cloud native features. I understand that you are managing a very complex monolith with thousands of untouchable interdependencies. When I try to help recommending to rehost your platform to cloud is to leverage at least some several cloud features of backuping your infra and to reduce RPO and RTO, also getting a CDN on front to cache request and release some requests burden, also possibility of vertical scalling is another feasible possibility at the cloud. Additionally, if you believe that you could get malicious requests, a WAF or some rate limiting controller can help. I just try to help as a metal fan of the really good job that you are doing on this web.

It would still explode our budget. As Ivor said, when you want to make AWS (or Azure or GCP) work for you, you'll soon be in a position where you depend on proprietary services to operate with costs lining up. The beauty about Metal Storm is that everything is in our control and we run exclusively on OSS.
The reasone we don't run a WAF is because I don't want to deal with tuning one and so far it hasn't been an issue. Running a front cache / edge router would mean terminating TLS at that pointand giving that provider access to the requests. All of that would be a downgrade in one way or another. We're running MS without any tracking except for the registration Captcha. I would absolutely hate to give that up and run everything through a third-party provider.
Also, this is still a hobby project. We're faaaaar from defining RTOs and RPOs don't really apply given that nothing here is business-critical. If the site breaks, it breaks. I'll get to it when I get to it.
----
Loading...
27.03.2024 - 00:09
Guib
Thrash Talker
Written by corrupt on 26.03.2024 at 22:37

Written by Guib on 26.03.2024 at 21:53
Yes! Let me donate ffs. Let us help lol. I've even messaged about this a while back and never got a reply....

We're working on it now. It's not as easy as pasting a link to one of our private Paypal accounts. We need to connect the money we collect as donations with the non profit entity that owns Metal Storm and make sure the legalities of it are sound. Give us some time, we'll come up with something.

Oh no take your time of course, I would never presume to know what it requires and the time you guys need to put it in place. Also, sorry if it did sound a bit pushy, it was not at all the intention. I'm just particularly eager at helping out a website that I've been loving and counting on since I was a teen.
----
- Headbanging with mostly clogged arteries to that stuff -
Guib's List Of Essential Albums
- Also Thrash Paradise
Thrash Here
Loading...
27.03.2024 - 00:18
LeKiwi
High Fist Prog
Written by corrupt on 26.03.2024 at 14:55

We also still have rate limiting in place. As said before, we are open to feedback on that front, but no promises. Right now our main priority is running the site reliably without outage or major slowdowns.

I'm really glad the site is back up and running...I had a few days there contemplating how I'd manage without MS (hint: it wasn't good )

On the rate limiting: I'm hitting it quite a bit, especially when I open 50 tabs from all my updates. Maybe it's only then. That said, I'd rather take a small hit if it keeps the site free from external harm.
Loading...
27.03.2024 - 07:37
Cynic Metalhead
Ambrish Saxena
Kudos to both of these stalwarts for pulling up amazing work. I had a couple of DevOps in my company corroborated how fucking hefty the work gets in to fix it.

Well done!
Loading...
27.03.2024 - 10:22
Ivor
Staff
Written by tludmetal on 26.03.2024 at 23:36

Well, I was not talking about getting a microservices project or leveraging cloud native features. I understand that you are managing a very complex monolith with thousands of untouchable interdependencies. When I try to help recommending to rehost your platform to cloud is to leverage at least some several cloud features of backuping your infra and to reduce RPO and RTO, also getting a CDN on front to cache request and release some requests burden, also possibility of vertical scalling is another feasible possibility at the cloud. Additionally, if you believe that you could get malicious requests, a WAF or some rate limiting controller can help. I just try to help as a metal fan of the really good job that you are doing on this web.

Don't get me wrong, your concern is much appreciated. I think when we talked about moving MS off our own metal over 5 years ago we tried to estimate the cost of an AWS hosted setup. Given that our single most heavy concern at the time was the diskbound IO because of the suboptimal database structure we came up with numbers exceeding what we were paying for our own rack server. In a way, this same thing bit us this time, meaning that we ran into resource throttling issues on a virtual hosted platform. Not that we don't want to run a state of the art infra setup, it's just for now, until the old code gets untangled and refactored, it's somewhat unfeasible.

I.
Loading...
27.03.2024 - 14:32
tludmetal
Completely understandable.
Loading...
28.03.2024 - 07:41
24emd
Theory Snob
I would gladly donate to this incredible site.
----
"I am too stupid to be human, and I lack common sense." - Proverbs 30:2
"Music? Well, it's just entertainment, folks!" - Devin Townsend

Best 2024 Albums
Loading...