Posted: 10 Sep 2012
Categories:
node.js
software engineering
taskrabbit
These ideas have now been formally incorporated into the actionHero project. To learn how to launch actionHero in a clustered way, check out the wiki.
While other servers also use SIGWINCH to mean "kill all my workers" it's important to note that this signal is fired when you resize your terminal window (responsive console design anyone?). Be sure that only demonized/background process respond to SIGWINCH!
I was asked recently how to deploy actionHero to production. Initially, my naive answer was to simply suggest the use of forever, but that was only a partial solution. Forever is a great package which acts as a sort of Deamon-izer for your projects. It can monitor running apps and restart them, handle stdout/err and logging, etc. It's a great monitoring solution, but when you say forever restart myApp you will incur some downtime. I've spent the past few days working on a full solution.
Footnote - This is a *nix (osx/linux/solaris) answer only. I'm fairly sure this kind of thing won't work on windows.
At TaskRabbit (a Ruby/Rails Ecosystem) we have put in a lot of effort into "properly" implementing Capistrano and Unicorn so that we can have 0-downtime deployment. This is integral to our culture, and allows us to deploy worry-free a number of times each day. This also makes the code-delta in our deployments smaller, and therefore less risky (saying nothing of the value in reducing the time it takes to launch new features). 0-downtime deployments are good.
Ok, so how to make a 0-downtime node deployment? Forever is certainly part of the solution, but the meat of the answer lies in the node.js cluster module (and how awesome node is at being unix-y).
The cluster module allows one node process to spawn children and share open recourses with them. This might include file pointers, but in our case, we are going to share open ports and unix sockets. In a nutshell, if you have one worker open port 80, other workers can also listen on port 80. The cluster module will share the load between all available workers.
The cluster module is usually approached as a way to load balance an application (and it's great at that), but it also can be used as a way to hand over an open connection from one worker to another. In this way, we can tell one worker to die off while another is starting. With enough workers running (and some basic callbacks), we can ensure that there is always a worker around to handle any incoming requests
This is some core node magic right here. Whether you have created an HTTP server or a direct TCP server, the default behavior of server.close() is actually quite graceful by default. Check out the docs and you sill see that the server will close, but not kick out any existing connections, and finally when all clients have left, a callback is fired. We will be waiting for this callback to know that it is safe to close out our server.
For an HTTP server this is pretty straight forward: no new connections will be allowed in, and any long-lasting connections will have the chance to finish. In our cluster setup, that means that any new connections that come in during this time will be passed to another worker... exactly what we want! (note: it's possible that a connection might not ever finish, but that's out of scope for this discussion)
Raw TCP connections are another matter. The server behaves the same way, but TCP connections never really expire, so if we don't kick out existing connections, the server will never exit. Take a look at this snippit of code from actionHero's socketServer:
1 api.socketServer.gracefulShutdown = function(api, next, alreadyShutdown){
2 if(alreadyShutdown == null){alreadyShutdown = false;}
3 if(alreadyShutdown == false){
4 api.socketServer.server.close();
5 alreadyShutdown = true;
6 }
7 for(var i in api.socketServer.connections){
8 var connection = api.socketServer.connections[i];
9 if (connection.responsesWaitingCount == 0){
10 api.socketServer.connections[i].end("Server going down NOW\r\nBye!\r\n");
11 }
12 }
13 if(api.socketServer.connections.length != 0){
14 api.log("[socket] waiting on shutdown, there are still " + api.socketServer.connections.length + " connected clients waiting on a response");
15 setTimeout(function(){
16 api.socketServer.gracefulShutdown(api, next, alreadyShutdown);
17 }, 3000);
18 }else{
19 next();
20 }
21 }
In the part of the TCP server that handles incoming requests, we increment the connection's connection.responseWaitingCount and when the action completes and the response is sent to the client, we decrement it. This way we can approximate the client is "waiting for a response" or not. It's important to remember that TCP clients can request many actions at the same time (unlike HTTP, where each request can only ever have one response). Note that once a client is deemed fit to disconnect we send a 'goodbye' message. The client then is responsible for reconnecting, and they will come back and connect to another worker.
WebSockets work the same way as the TCP server does. Once we disconnect each client, they will reconnect to a new worker node, as the old one has stopped taking connections. socket.io's browser code is very well written and will reconnect and retry any commands that have failed. socket.io binds to the http server we talked about earlier, so shutting it down will also disconnect all websocket clients.
Now that we have servers that gracefully shut down, how do we use them?
The reason for gracefully disconnecting each client was that we are not going to restart each server, but rather kill it entirely and create a new one. Creating a new worker ensures that each process will load in any new code and have a fresh environment to work within.
The Cluster master has a few main goals:
As mentioned before, open sockets and ports can be shared (for free) by all children in the cluster. Yay node!
The cluster module provides a message passing interface between master and slave. You can pass anything that can be JSON.stringified (no pointers). We can use these methods to be aware of when a booted worker is ready to accept connections, and conversely, we can tell a worker to begin its graceful shut down process (rather than outright killing the process). Take a look at the worker code at the bottom of the article for more details. Note the use of process.send(msg) within the callbacks for actionHero.start() and actionHero.stop().
Unix signals are the classy way to communicate with a running application. You send them with the kill command, and each signal has a common meaning:
So if you wanted to tell the master to stop all of his workers (and his pid was 123), you would run kill -s WINCH 123
USR2 is the most interesting case here. While there are ways to "reload configuration" in a running node.js app (flush the module cache, reload all source modules, etc), it's usually a lot safer just to start up a new app from scratch. I say that we are going to do a "rolling restart" because we literally are going to kill off the first worker, spawn a new one, and repeat. Assuming we have 2 or more workers, this means that there will always be at least one worker around to handle requests. Now this might lead to problems where some workers have an old version of your codebase and some workers have a new version, but usually that is desirable when compared with outright downtime. Oh, and try not to have more workers than you have CPUs!
The main function in charge of these "rolling restarts" is here:
1 var reloadAWorker = function(next){
2 var count = 0;
3 for (var i in cluster.workers){ count++; }
4 if(workersExpected > count){
5 startAWorker();
6 }
7 if(workerRestartArray.length > 0){
8 var worker = workerRestartArray.pop();
9 worker.send("stop");
10 }
11 }
12
13 cluster.on('exit', function(worker, code, signal) {
14 log("worker " + worker.process.pid + " (#"+worker.id+") has exited");
15 setTimeout(reloadAWorker, 1000) // to prevent CPU-splsions if crashing too fast
16 });
When we initialize a rolling restart, we add all workers to the workerRestartArray, and then one-by-one they will be dropped. Note that on every worker's exit, we run reloadAWorker(). This also ensures that if a worker died due to an error, we will start another one in its place (workersExpected is modified by TTIN and TTOU). The reason for the timeOut is to ensure that if a worker is crashing on boot (perhaps it can't connect to your database) that the master isn't restarting workers are fast as possible... as this would probably lock up your machine.
Here is the state of actionHero's cluster master code at the time of this post. It's likely to keep evolving, so you can always check out the latest version on GitHub
1 #!/usr/bin/env node
2
3 // load in the actionHero class
4 var actionHero = require(__dirname + "/../api.js").actionHero; // normally if installed by npm: var actionHero = require("actionHero").actionHero;
5 var cluster = require('cluster');
6
7 // if there is no config.js file in the application's root, then actionHero will load in a collection of default params. You can overwrite them with params.configChanges
8 var params = {};
9 params.configChanges = {};
10
11 // any additional functions you might wish to define to be globally accessable can be added as part of params.initFunction. The api object will be availalbe.
12 params.initFunction = function(api, next){
13 next();
14 }
15
16 // start the server!
17 var startServer = function(next){
18 if(cluster.isWorker){ process.send("starting"); }
19 actionHero.start(params, function(api_from_callback){
20 api = api_from_callback;
21 api.log("Boot Sucessful @ worker #" + process.pid, "green");
22 if(typeof next == "function"){
23 if(cluster.isWorker){ process.send("started"); }
24 next(api);
25 }
26 });
27 }
28
29 // handle signals from master if running in cluster
30 if(cluster.isWorker){
31 process.on('message', function(msg) {
32 if(msg == "start"){
33 process.send("starting");
34 startServer(function(){
35 process.send("started");
36 });
37 }
38 if(msg == "stop"){
39 process.send("stopping");
40 actionHero.stop(function(){
41 api = null;
42 process.send("stopped");
43 process.exit();
44 });
45 }
46 if(msg == "restart"){
47 process.send("restarting");
48 actionHero.restart(function(success, api_from_callback){
49 api = api_from_callback;
50 process.send("restarted");
51 });
52 }
53 });
54 }
55
56 // start the server!
57 startServer(function(api){
58 api.log("Successfully Booted!", ["green", "bold"]);
59 });
1 #!/usr/bin/env node
2
3 //////////////////////////////////////////////////////////////////////////////////////////////////////
4 //
5 // TO START IN CONSOLE: `./scripts/actionHeroCluster`
6 // TO DAMEONIZE: `forever start scripts/actionHeroCluster`
7 //
8 // ** Producton-ready actionHero cluster example **
9 // - workers which die will be restarted
10 // - maser/manager specific logging
11 // - pidfile for master
12 // - USR2 restarts (graceful reload of workers while handling requets)
13 // -- Note, socket/websocket clients will be disconnected, but there will always be a worker to handle them
14 // -- HTTP, HTTPS, and TCP clients will be allowed to finish the action they are working on before the server goes down
15 // - TTOU and TTIN signals to subtract/add workers
16 // - WINCH to stop all workers
17 // - TCP, HTTP(s), and Web-socket clients will all be shared across the cluster
18 // - Can be run as a daemon or in-console
19 // -- Lazy Dameon: `nohup ./scripts/actionHeroCluster &`
20 // -- you may want to explore `forever` as a dameonizing option
21 //
22 // * Setting process titles does not work on windows or OSX
23 //
24 // This example was heavily inspired by Ruby Unicorns [[ http://unicorn.bogomips.org/ ]]
25 //
26 //////////////////////////////////////////////////////////////////////////////////////////////////////
27
28 //////////////
29 // Includes //
30 //////////////
31
32 var fs = require('fs');
33 var cluster = require('cluster');
34 var colors = require('colors');
35
36 var numCPUs = require('os').cpus().length
37 var numWorkers = numCPUs - 2;
38 if (numWorkers < 2){ numWorkers = 2};
39
40 ////////////
41 // config //
42 ////////////
43
44 var config = {
45 // script for workers to run (You probably will be changing this)
46 exec: __dirname + "/actionHero",
47 workers: numWorkers,
48 pidfile: "./cluster_pidfile",
49 log: process.cwd() + "/log/cluster.log",
50 title: "actionHero-master",
51 workerTitlePrefix: " actionHero-worker",
52 silent: true, // don't pass stdout/err to the master
53 };
54
55 /////////
56 // Log //
57 /////////
58
59 var logHandle = fs.createWriteStream(config.log, {flags:"a"});
60 var log = function(msg, col){
61
62 var sqlDateTime = function(time){
63 if(time == null){ time = new Date(); }
64 var dateStr =
65 padDateDoubleStr(time.getFullYear()) +
66 "-" + padDateDoubleStr(1 + time.getMonth()) +
67 "-" + padDateDoubleStr(time.getDate()) +
68 " " + padDateDoubleStr(time.getHours()) +
69 ":" + padDateDoubleStr(time.getMinutes()) +
70 ":" + padDateDoubleStr(time.getSeconds());
71 return dateStr;
72 };
73
74 var padDateDoubleStr = function(i){
75 return (i < 10) ? "0" + i : "" + i; };
76 msg = sqlDateTime() + " | " + msg;
77 logHandle.write(msg + "\r\n");
78 if(typeof col == "string"){col = [col];}
79 for(var i in col){
80 msg = colors[col[i]](msg);
81 }
82 console.log(msg);
83 }
84
85 ////////// // Main // //////////
86 log(" - STARTING CLUSTER -", ["bold", "green"]);
87 // set pidFile
88 if(config.pidfile != null){
89 fs.writeFileSync(config.pidfile, process.pid.toString());
90 }
91 process.stdin.resume();
92 process.title = config.title;
93 var workerRestartArray = [];
94
95 // used to trask rolling restarts of workers
96 var workersExpected = 0;
97
98 // signals
99 process.on('SIGINT', function(){
100 log("Signal: SIGINT");
101 workersExpected = 0;
102 setupShutdown();
103 });
104
105 process.on('SIGTERM', function(){
106 log("Signal: SIGTERM");
107 workersExpected = 0;
108 setupShutdown();
109 });
110
111 process.on('SIGKILL', function(){
112 log("Signal: SIGKILL");
113 workersExpected = 0;
114 setupShutdown();
115 });
116
117 process.on('SIGUSR2', function(){
118 log("Signal: SIGUSR2");
119 log("swap out new workers one-by-one");
120 workerRestartArray = [];
121 for(var i in cluster.workers){
122 workerRestartArray.push(cluster.workers[i]);
123 }
124 reloadAWorker();
125 });
126
127 process.on('SIGHUP', function(){
128 log("Signal: SIGHUP");
129 log("reload all workers now");
130 for (var i in cluster.workers){
131 var worker = cluster.workers[i];
132 worker.send("restart");
133 }
134 });
135
136 process.on('SIGWINCH', function(){
137 log("Signal: SIGWINCH");
138 log("stop all workers");
139 workersExpected = 0;
140 for (var i in cluster.workers){
141 var worker = cluster.workers[i];
142 worker.send("stop");
143 }
144 });
145
146 process.on('SIGTTIN', function(){
147 log("Signal: SIGTTIN");
148 log("add a worker");
149 workersExpected++;
150 startAWorker();
151 });
152
153 process.on('SIGTTOU', function(){
154 log("Signal: SIGTTOU");
155 log("remove a worker");
156 workersExpected--;
157 for (var i in cluster.workers){
158 var worker = cluster.workers[i];
159 worker.send("stop");
160 break;
161 }
162 });
163
164 process.on("exit", function(){
165 workersExpected = 0;
166 log("Bye!")
167 });
168
169 // signal helpers
170 var startAWorker = function(){
171 worker = cluster.fork();
172 log("starting worker #" + worker.id);
173 worker.on('message', function(message){
174 if(worker.state != "none"){
175 log("Message ["+worker.process.pid+"]: " + message);
176 }
177 });
178 }
179
180 var setupShutdown = function(){
181 log("Cluster manager quitting", "red");
182 log("Stopping each worker...");
183 for(var i in cluster.workers){
184 cluster.workers[i].send('stop');
185 }
186 setTimeout(loopUntilNoWorkers, 1000);
187 }
188
189 var loopUntilNoWorkers = function(){
190 if(cluster.workers.length > 0){
191 log("there are still " + cluster.workers.length + " workers...");
192 setTimeout(loopUntilNoWorkers, 1000);
193 }else{
194 log("all workers gone");
195 if(config.pidfile != null){
196 fs.unlinkSync(config.pidfile);
197 }
198 process.exit();
199 }
200 }
201
202 var reloadAWorker = function(next){
203 var count = 0;
204 for (var i in cluster.workers){ count++; }
205 if(workersExpected > count){
206 startAWorker();
207 }
208 if(workerRestartArray.length > 0){
209 var worker = workerRestartArray.pop();
210 worker.send("stop");
211 }
212 }
213
214 // Fork it.
215 cluster.setupMaster({
216 exec : config.exec,
217 args : process.argv.slice(2),
218 silent : config.silent
219 });
220 for (var i = 0; i < config.workers; i++) {
221 workersExpected++;
222 startAWorker();
223 }
224 cluster.on('fork', function(worker) {
225 log("worker " + worker.process.pid + " (#"+worker.id+") has spawned");
226 });
227 cluster.on('listening', function(worker, address) {
228
229 });
230 cluster.on('exit', function(worker, code, signal) {
231 log("worker " + worker.process.pid + " (#"+worker.id+") has exited");
232 setTimeout(reloadAWorker, 1000) // to prevent CPU-splsions if crashing too fast
233 });
Enjoy!
<< NPM and run-script I am pretty >>
Follow me on Twitter Follow me on Github