Weekend Slowness Report

TalkTalk about LibraryThing

Join LibraryThing to post.

Weekend Slowness Report

1timspalding
Edited: Aug 21, 1:02 pm

The slowness and connection errors of the weekend are now over, and we understand what caused them. So I wanted to write up an incident report on what happened and what we will be doing to fix them in the future.

1. We sent out the State of the Thing on Friday afternoon. SOTT goes out to 1.2 million accounts, and although not everyone opens it, that's a lot of emails.
2. The emails contained a number of extremely large images that came directly from the LibraryThing servers. Such images usually come from our "CDN" (content delivery network), which is separate from the main LibraryThing servers and can take an enormous amount of traffic without blinking.
3. The images were also too large as files; we should have used smaller ones.
4. Realities 1-3 caused an enormous (up to 700%) rise in traffic to LibraryThing's "pics" server. The extra traffic led to failures across other LibraryThing services that share the same network "pipe."

The problem was not detected and fixed quickly, because:

1. It was the weekend.
2. Most metrics were not affected. For example, actual LibraryThing pages didn't experience a big rise in generation times; the problems happened outside that system, in the system that receives requests and sends the data back to the user.
3. Despite a lot of alerting on a lot of systems, our alerting has gaps, and for a time this fooled us into thinking the problems were not present or serious, at least until we stopped looking at metrics and tried to use the site. We need to redesign the system to alert on problems with the basic, normal functionality of the site as experienced by members.

We're fixing this as follows:

1. Future SOTTs will be made with safe, CDN resources. This one didn't because developers and systems had been playing with our CDN options, and didn't realize how the URLs were going to be used, or just how bad using a lot of non-CDN images might get.
2. Future SOTTs will use smaller images.
3. We will be revisiting our monitoring and alerting to never miss the easy stuff—breakage and slowness in the basic, ordinary functionality of the site. This is not always the easiest thing to monitor automatically.

Thank you all for your patience this weekend. We'll try to do better.

2bnielsen
Aug 21, 1:21 pm

>1 timspalding: Thanks for telling us what the problem was. I for one have a day job where this sort of information is good to have. Even though we have a lot of war stories of our own :-)

3DuncanHill
Aug 21, 1:31 pm

"thinking the problems were not present or serious, at least until we stopped looking at metrics and tried to use the site"

There are so many websites where it's obvious the operators never try to use the site. But the metrics are great!

4waltzmn
Aug 21, 2:20 pm

>2 bnielsen: Thanks for telling us what the problem was.

I second this. I don't even have to deal with this very much, but I want to know what's going on. So, thank you for the report!

5Keeline
Aug 21, 3:40 pm

That probably accounts for my attempt to view cover images for the Windermere series collection yesterday afternoon (PDT) and seeing dial-up speed loads of the images. After a reload didn't help, I disconnected and reconnected my VPN and it seemed to be better. (Nearly any Internet problem my wife suspects the VPN).

In my day job I manage up to 42 Linux servers and half a dozen or more MySQL servers. The systems are so diverse that it can be hard to monitor everything and resolve problems as they come up. It's not exactly the same as what you work with but I can relate and empathize.

James

6laytonwoman3rd
Aug 21, 4:26 pm

Glad we're back up to speed. And I second (or third, or whateveritis now) the THANKS for letting us know what happened and what's going to be done to prevent it from happening again!

72wonderY
Aug 21, 4:44 pm

Are there any other public membership sites that bother to converse with the members? Tim, you continue to amaze me.

8LibraryCin
Aug 21, 11:14 pm

Like others, I thank you for the update and explanation.

9Nicole_VanK
Aug 22, 1:36 am

Thank you Tim. (A very understandable mishap.)

10susanbooks
Aug 22, 2:52 am

This is one of the many reasons we love the site. Thank you, Tim.