Recent server timeout errors — anyone else getting them?

How would you test the system if you don’t allow users in? Unless the people working on the problem are able to reproduce it themselves under current conditions, they wouldn’t know what to test for after deleting the records.

I don’t pretend to know any of the technicalities here, so let me just ask - how long would we have to be offline to shut everything down, upgrade the database, and do enough small-scale testing to risk going back online? 24 hours? A week?

If you want to conduct a test by deleting records, it would make more sense to copy the whole database to a test bed and delete the records there. It would be far less risky than deleting the records in place. I don’t know how many times I’ve seen backups fail.

Why would you want to delete records??

What you could try is to run “innotop” to check the DB server(s) performance.

I assume they have already run a lot of programs to check the database out?

Yes you would have to allow users in for a real test , I left that out. It would be ideal to copy the whole database to another system if one is available for testing but that may not be possible.

Who knows if it’s happening to some users more than others? Maybe the people working on it know, but we users don’t. Just because a few users are posting regularly to this thread, doesn’t mean they are the only ones getting it. It may be happening to most or all other users too for all we know.

As for creating a test bed –
If the large size of the database, or the large number of users, or the large number of concurrent users is the problem, then any attempt to test with a smaller database or fewer concurrent users will fail to detect the problem.

Copying the entire DB to another test bed and attempting to do testing on that may or may not work – if you can duplicate the failure there, you can then do contrived testing to narrow down the problem. But there’s a good chance that an entire copy of the whole DB on another machine will work just fine, and you won’t learn anything.

Here’s an observation that might be salient:

It looks to me like, when the system stalls, it stalls for everyone at once. Is there any way to test this hypothesis?

My evidence: As suggested above, when I open a thread, I try to open a whole bunch of threads in rapid succession in separate tabs, so if the system later stalls while I post a reply, I will have other threads to keep me entertained for a while.

And, while opening multiple threads like this, I observe the following: If it stalls when opening one thread, it stalls for all of them. This suggests the possibility that other users trying to open threads are also seeing stalls at the same time.

If the stall remains stalled for a certain period of time, then all of my stalled tabs will start getting errors (502 and 504). But the server may unhang itself before that happens, and then ALL of threads I am opening will appear in all those tabs in rapid succession.

So when it stalls, it may be stalling for all thread opens for everyone, and when it unstalls it it seems to unstall at once for all stalled opens in progress.

Does this hint at a possible diagnosis? A lock or concurrency problem? Maybe an intermittent network hardware glitch? I almost think I’d be looking for a transient hardware malfunction.

I saw one major backup fail of triple-:smack: proportions, that I hope never to see again.

Story in spoiler tags because it’s not really on-topic for this thread.

[spoiler]I was sys admin at a company running Unix 4.2 BSD on a VAX 11-750 in the ancient year of 1984. We got an upgrade from Mt. Xinu for 4.3 BSD. This upgrade was a total re-design of the file system, requiring a full re-formatting of the disks.

Came the big day to do it: I made full back-ups. Did the full re-format and created the new partitions and file systems. Installed the new system.

Then ran the full restore.

That didn’t work quite right. Turned out, the all-new 4.3 restore program for the all-new 4.3 file system didn’t quite know how to restore 4.2 dumps. Oops. Fine time to discover that. In particular, symbolic links in 4.3 were entirely different from 4.2 symbolic links, and it apparently didn’t occur to anybody at Berkeley or Mt. Xinu to deal with that. So every symbolic link got restored wrong, and our programmers at my company were using them extensively.

Every restored symbolic link was garbage, and fsck gave errors for every one of them, going on for pages and pages. We had to manually delete every one, one at a time. Even after that, I wasn’t convinced that the remaining file system was really clean, but everything that was left seemed to work and fsck was happy.[/spoiler]

As I have said repeatedly in this thread, it happens to me only occasionally. I’d say I’ve seen an actual error only once in the past week. I’ve had SDMB work perfectly for me at times when several people are complaining that it’s unusable.

For several hours now. not a single error or stall, Was something fixed?

I’m noticing that too!

could be that with a small number of users it works fine.

I noticed it yesterday afternoon too, so it included busy times.

Since we don’t see it displayed at the bottom of the forum home page, I’ll ask: during a typical daily peak 15-minute period at the SDMB:

  • How many members are logged in?
  • How many non-member visitors are here?

If there’s a low ratio of members to visitors (let’s say 100 members, but 1,500 guests), we’re looking at a lot of junk traffic — content scrapers, foreign search engines, spambots, and the like. Cut the junk traffic, and you cut database queries, bandwidth, and CPU cycles. It’s a start. I’ll even give you some CIDR blocks to put in the firewall, robots.txt, and the .htaccess directives — there’s not many.

Next up: GET RID OF VBULLETIN 3! It’s past end-of-life, and the versions of PHP and MySQL it depends on (5.*) are also approaching EOL. Consider XenForo (from the programmers who originally wrote vBulletin 3). It’s faster, more fully features, supports paid subscriptions out of the box, and stable as hell. It supports mobile browsers without needing special templates or Tapatk. It’s also very affordable, and a bunch of boards that are much larger than the SDMB use it now. Invision Power Board is good if you’re looking for something with content management. vBulletin 4 is just a less stable version of vBulletin 3 with a few more features, and vBulletin 5 is sloooooooooow, even on smaller message boards.

FWIW, in recent months, aggressive Chinese bots have been bringing down a lot of sites, including message boards. See this for details and a .htaccess fix.

Because some users are in this thread saying that it happens to them with great frequency, and others are in this thread saying that they’ve seen the errors but only experiencing them rarely.

IOW, even within this thread … it’s happening at different rates to different users.

Myself? I don’t get a 502/504 daily though I am on the board 30-60 minutes every day. I’ve only seen a handful this entire month :shrug:

I HAVE seen slowness at times – maybe 3 or 4 times times per week – where it looks like an error is imminent, but the page loads “just in time” (I guess?).

Same.

Crossing fingers.

Seems to be happening to me a lot less than it’s happening to some other people. I sometimes check this thread and see people complaining of serious problems when I’m not having any.

Repeating in case it’s relevant: Firefox on a Mac, currently 10.13.4; Firefox either whatever’s the latest Mac version or the one just behind (Firefox seems to update every few days; I generally wait a couple days to install updates in case I hear of problems.)

I also usually open a whole bunch of threads at once, in separate tabs but the same window. Usually there’s no problem. When there’s a problem, it’s usually with several threads; and sometimes but not always with all of them – but bear in mind that they’re not all starting to try to load at the exact same moment; I’ll be scrolling down through ‘new posts’ deciding which ones to click on, sometimes two or more in quick succession but sometimes with significant pauses inbetween. So what seems to be happening is that any thread I try to open at the wrong time hangs and/or gives an error message, but threads I try to open sooner or later may open with no problem; though sometimes the problem exists for long enough that all the tabs give me error messages. That hasn’t happened to me for some days, though.

I’ll try refreshing the pages that produced errors in a bit, and if one of the threads comes up OK, the others will usually do so also. So, for me at least, it’s not a problem with a specific thread; it’s a problem with the board overall. But for me, it’s rarely enough of a problem to be more than a minor nuisance.

In any case, I’m at the far end of a slow internet connection. I’m used to things not loading immediately; I’m still on the World Wide Wait.

DAMN. I just got a 504 error:

** 504 Gateway Time-out

shield**