Strange behaviour in Parse server

Manuel · May 17, 2021, 9:09am

After adding maxPoolSize=1000 to the db connection string in the parse-server config file it has started generating more logs. At the same time the node --max-old-space-size setting allows parse-server to run more.

It seems to me as if it just takes longer until the issues occur.

By increasing the connection pool size you open and maintain more connections to the DB. That should only make a performance difference (for better or worse). Unless you have a high variance in query execution times where long running queries delay the execution of waiting queries so severely that they time out, it should only make a difference in performance, not in stability. That may be something you want to look into.

By increasing the old space size, you assign more RAM before GC, so unless you see in the logs that the server / node process actually crashed, it shouldn’t make any difference in stability neither.

fidsamurai · May 17, 2021, 10:11am

Is it possible that due to the high volume of queries it was causing the timeout?

And yes I added --old-max-space-size due to node actually crashing with OOM.

Manuel · May 17, 2021, 10:35am

Is it possible that due to the high volume of queries it was causing the timeout?

It’s possible, you’d see that in the logs.

And yes I added --old-max-space-size due to node actually crashing with OOM.

I think if you revealed that earlier, it would have made identifying the issue easier. You would want to make sure no OOM’s are happening anymore. If there is in fact a memory leak, increasing the allocated RAM may just delay the crash, and the crashes start at a later time, because it takes more time for the leak to use up available memory. So I suggest you look closely whether you still see OOMs, possibly increased at a later point in time, and identify whether there is leak.

fidsamurai · May 17, 2021, 11:04am

The OOMs started occurring after adding maxPoolSize.

They don’t happen anymore.

However at peak load times it still causes a crash in parse-server despite having 5 t3a.small(2vcpus 2gb RAM) instances running concurrently.

Currently parse-server doesn’t throw any errors however Nginx keeps throwing
upstream timed out (110: Connection timed out) while connecting to upstream,

And if at the 110 if I close parse-server with ctrl+c I get about 150 logs of

"name":"MongoError","level":"error","message":"Uncaught internal server error. server is closed","stack":"MongoError: server is closed\n    at Server.query (/usr/lib/node_modules/parse-server/node_modules/mongodb/lib/core/sdam/server.js:299:16)\n    at FindOperation.execute (/usr/lib/node_modules/parse-server/node_modules/mongodb/lib/operations/find.js:30:12)\n    at Object.callback (/usr/lib/node_modules/parse-server/node_modules/mongodb/lib/operations/execute_operation.js:151:17)\n    at processWaitQueue (/usr/lib/node_modules/parse-server/node_modules/mongodb/lib/core/sdam/topology.js:1047:21)\n    at NativeTopology.selectServer (/usr/lib/node_modules/parse-server/node_modules/mongodb/lib/core/sdam/topology.js:448:5)\n    at executeWithServerSelection (/usr/lib/node_modules/parse-server/node_modules/mongodb/lib/operations/execute_operation.js:137:12)\n    at executeOperation (/usr/lib/node_modules/parse-server/node_modules/mongodb/lib/operations/execute_operation.js:75:7)\n    at Cursor._initializeCursor (/usr/lib/node_modules/parse-server/node_modules/mongodb/lib/core/cursor.js:531:7)\n    at Cursor._initializeCursor (/usr/lib/node_modules/parse-server/node_modules/mongodb/lib/cursor.js:184:11)\n    at nextFunction (/usr/lib/node_modules/parse-server/node_modules/mongodb/lib/core/cursor.js:734:10)","timestamp":"2021-05-17T11:06:24.725Z"

Manuel · May 17, 2021, 11:59am

You could look at the the DB’s query execution stats, there may be queries that time out. You can instruct the MongoDB Node driver to instruct the MongoDB server process to cancel any queries that take longer than n seconds. Take a look at databaseOptions: maxTimeMS in the Parse Server Options. You could try to set that to a value like 30s (note that the value has to be set in ms), if that is also the server time out for requests.

See docs.

fidsamurai · May 19, 2021, 12:02pm

Hey Manuel,

It’s still didn’t work with the databaseOptions: maxTimeMS.

I tried increasing the number of hosts but that has increased the number of logs to an extent but it’s only because of the number of instances, like you said earlier the maxPoolSize and node settings seem completely unrelated.

My last suspicion is that parse-server has some kind of query limit per second.
However I went through the docs and it seems that that isn’t the case either.

Any other ideas as to what is causing this crash?
Things I’ve tried -
increased ulimit for node
Nginx worker_rlimit set to 100000
Nginx worker_connections set to 50000
horizontal and vertical scaling of instances
httpServer.timeout as suggested by you -

httpServer.timeout = 60 * 1000;
httpServer.keepAliveTimeout = 70 * 1000;
httpServer.headersTimeout = 120 * 1000;

We’re currently at 7 t3a.small instances
But there’s no clear root cause for this still.

For additional diagnostics I even tried db.currentOp and it doesn’t go above 3, which means that all queries get written almost immediately.

Any other ideas?

P.S. Increased the number of instances because I noticed that at peak times(when the server starts crashing), we get about 688k requests total, which means 137,600 per instance.

Manuel · May 19, 2021, 12:37pm

How many queries did time out when you set maxTimeMS?
Maybe this could be a bottleneck at the MongoDB server.
Maybe it could be some other network component that times out, depending on your infrastructure.

Here are some approaches I would suggest going forward:

Look at the logs, find errors and work your way up. I mean all the logs, server instance metrics, nginx metrics, node metrics (try transactions analysis using a tool like NewRelic), DB server logs, mongo longs, load balancer logs, node logs.
Try to replicate the issue in a controlled manner, focus on which components behave oddly.
Eliminate as many components as possible see whether the issue still occurs, then gradually add components. Maybe even try to use a DBaaS like MongoDB Atlas instead of a self-hosted DB and see whether that changes anything, which may point to a DB issue.

That is about as much as I can suggest from the distance. We already tried the obvious. There are just too many variables at this point that make it hard to focus on anything specific.

fidsamurai · May 31, 2021, 8:12am

Hi Manuel,

It’s really weird but after we last corresponded, the system has been working flawlessly.
Last change I had done was to keep increase the File Descriptors for both Node and Nginx.
If the requests per host is less than about a 100k it seems to stay healthy.
Just wondering if this is a limitation with Node, Parse-Server or Nginx?

FYI we’re hitting around 1.5-2.4 million hits inside half an hour.
Nginx and node have a File Descriptor increase to 100K and we have 7 instances running, t3a.small with node --max-old-space-size set to 1500.

Manuel · May 31, 2021, 9:00am

Whether you hit the file descriptor limits or not can easily be verified in the nginx or system metrics. Nginx for example has a dedicated error message in the logs for this. Also, the max file descriptors per process are limited on the instance OS level, that’s another metric to look at.

fidsamurai · June 7, 2021, 7:31am

Marking my last answer as the solution.

@Manuel, thanks a ton for the continuous and quick responses and help!