zero-downtime deployments with unicorn and supervisord

written by Raphael

In this post I'll discuss zero-downtime deployments using unicorn and supervisord. There's a lot more to zero-downtime deployments then just keeping your website available. Listen to Ruby Rogues Ep. 71 or search google for a broader discussion of the problems involved.


When running a web application in production you should strive for 100% reachability. Down-times are normally perceived as errors in your application; and rightfully so. If you deploy often your users might stop using your app because of 502er they encounter.

Since I like to use supervisord in my production setup the most widely used unicorn setup for zero-downtime deployments does not work out of the box. Supervisord requires the unicorn process to not daemonize. Also sending SIGUSR2 to unicorn causes the old master to die. Since supervisord watches the old master this will cause it to consider the application as exited, even tho it's running with a new process id. Finally Supervisor will try to restart the application, and fail to do so because all sockets are in use by the new unicorn master.

Luckily, there's an utility called unicornherder. Unicornherder does not daemonize itself and keeps an eye on the unicorn pid file to check if unicorn is still alive. All messages sent to unicornherder are forwarded to the unicorn process. If unicorn quits, unicornherder quits too.

So, in order to use SIGUSR2 and preload_app for zero-downtime deployments we need to install unicornherder.


# assuming you are running Ubuntu:
$ sudo apt-get install python-dev
$ pip install unicornherder
$ which unicornherder # => /usr/local/bin/unicornherder

Unicornherder itself does not require an additional configuration file. All required arguments are passed to the command line.

Next we need to configure supervisord:

Supervisord

Supervisord watches unicornherder, and unicornherder starts unicorn as a daemon. So all we need to do is to properly start unicornherder and make sure it keeps running.

Here's a sample supervisord configuration file I generated using foreman export


[program:myapp-unicornherder-1]
command=/home/webapp/.rvm/bin/app_bundle exec unicornherder -u unicorn -p tmp/pids/unicorn.pid -- -c config/unicorn.rb
autostart=true
autorestart=true
stopsignal=QUIT
stdout_logfile=/home/webapp/shared/log/unicornherder-1.log
stderr_logfile=/home/webapp/shared/log/unicornherder-1.error.log
user=webapp
directory=/home/webapp/current
environment=RAILS_ENV="production",APP_PATH="/home/webapp/current",SHARED_PATH="/home/webapp/shared",TEMP_PATH="/home/webapp/shared/tmp",PORT="8619"

[group:myapp]
programs=myapp-unicornherder-1

The details:

  • unicornherder is passed the path to the unicorn pidfile using the -p flag
  • supervisord will send the QUIT signal to unicornherder if we want to stop unicorn.
  • unicorn is executed in an RVM managed environment, and I'm using a RVM wrapper to load the correct ruby version and gemset.
  • basic unicorn configuration settings are exported into the environment

Unicorn

The unicorn configuration follows:


worker_processes ((ENV['RAILS_ENV'] == 'development') ? 2 : 8)

working_directory ENV["APP_PATH"]

listen ENV["PORT"].to_i, :tcp_nopush => true

timeout 30

pid (ENV["TEMP_PATH"] + "/pids/unicorn.pid")

stderr_path ENV["SHARED_PATH"] + "/log/unicorn.stderr.log"
stdout_path ENV["SHARED_PATH"] + "/log/unicorn.stdout.log"

preload_app true

before_fork do |server, worker|
  if defined?(ActiveRecord::Base)
    ActiveRecord::Base.connection.disconnect!
  end

  old_pid = ENV["TEMP_PATH"] + '/pids/unicorn.pid.oldbin'
  if File.exists?(old_pid) && server.pid != old_pid
    begin
      Process.kill("QUIT", File.read(old_pid).to_i)
    rescue Errno::ENOENT, Errno::ESRCH
      # someone else did our job for us
    end
  end
end

after_fork do |server, worker|
  if defined?(ActiveRecord::Base)
    ActiveRecord::Base.establish_connection
  end
end

The important points here is that we close any connections to external resources as the master has no use for them; Also note that we kill the old master as soon as the preloading is done.

Mina

If we deploy using Mina, we can use the following configuration to perform a zero-downtime deploy:


desc "Deploys the current version to the server."
task :deploy => :environment do
  deploy do
    # omitted

    to :launch do
      queue %[kill -s USR2 $(sudo supervisorctl status | grep unicornherder | cut -d' ' -f7 | cut -d',' -f1)]
    end
  end
end

and starting, stopping of unicorn is handled with supervisord:


desc "stop the application"
task :down do
  queue "sudo supervisorctl stop myapp:*"
end

desc "start the application"
task :up do
  queue "sudo supervisorctl start myapp:*"
end

Verify we got a zero-downtime deployment

Now it's time to verify our setup is actually working.

Running ab -c 2 -n 100 http://www.example.com/ while restarting our application should not result in ANY dropped connections. Note that this largly depends on how long your application needs to start up. We could further amplify the effects by adding fake calls to sleep in our application.rb.

Anyway, here it goes:

With restarts

This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking example.com (be patient).....done


Server Software:        nginx/1.2.4
Server Hostname:        example.com
Server Port:            80

Document Path:          /
Document Length:        22527 bytes

Concurrency Level:      2
Time taken for tests:   10.947 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      2319600 bytes
HTML transferred:       2252700 bytes
Requests per second:    9.13 [#/sec] (mean)
Time per request:       218.949 [ms] (mean)
Time per request:       109.475 [ms] (mean, across all concurrent requests)
Transfer rate:          206.92 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       55   58   1.8     57      69
Processing:   137  160  27.4    145     263
Waiting:       69   82  21.0     74     148
Total:        193  218  27.4    204     320

Percentage of the requests served within a certain time (ms)
  50%    204
  66%    215
  75%    242
  80%    249
  90%    265
  95%    271
  98%    274
  99%    320
 100%    320 (longest request)

Without restarts

This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking example.com (be patient).....done


Server Software:        nginx/1.2.4
Server Hostname:        example.com
Server Port:            80

Document Path:          /
Document Length:        22527 bytes

Concurrency Level:      2
Time taken for tests:   10.584 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      2319600 bytes
HTML transferred:       2252700 bytes
Requests per second:    9.45 [#/sec] (mean)
Time per request:       211.686 [ms] (mean)
Time per request:       105.843 [ms] (mean, across all concurrent requests)
Transfer rate:          214.02 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       55   58   1.5     58      65
Processing:   137  153  18.3    145     207
Waiting:       68   76   5.8     75     102
Total:        195  211  18.4    202     265

Percentage of the requests served within a certain time (ms)
  50%    202
  66%    204
  75%    215
  80%    219
  90%    248
  95%    251
  98%    252
  99%    265
 100%    265 (longest request)

No failed requests. It works! And the response times with multiple restarts are only slightly worse. Great!

I hope this blog post helped clarifing how to use unicorn and supervisord together while using zero-downtime deployments of your app server to keep serving requests.

Wrapping up:

  • unicorn requires unicornherder for zero-downtime deployments, if you are using supervisord
  • unicorn spawns a second master when sent SIGUSR2 which means you'll be running twice as mean workers as you specified during restarts

That's it! Happy hacking!


Setup HAProxy with Nginx on OS X

written by Raphael

This is a follow up for one of my earlier post about Replacing Pow.

I've been working a lot with Faye and PrivatePub lately, which cause me to stumble across some problems with the setup of mine. Primarily, moving from development to production environment.

The main problem is that WebSocket support will come to Nginx as of 1.3.x. - which is many months away.

A common solution is to use HAProxy in front of Nginx. HAProxy can handle WebSocket upgrades and messages properly. Regular HTTP requests are proxied to Nginx.

The new configuration of my development environment looks like this:

  • Nginx is listening on port 8080 and running as a user service. Prior Nginx was listening on port 80 and running as a system service.
  • HAProxy is listening on port 80 and running as a system service

Faye is started on a per project basis.

Installation instructions


$ brew install haproxy
$ mkdir -p $(brew --prefix)/etc/haproxy
$ cat >$(brew --prefix)/etc/haproxy/haproxy.cfg <<EOL
  global
    log 127.0.0.1   local0
    log 127.0.0.1   local1 debug
    #log loghost    local0 info
    maxconn 4096
    #chroot /usr/share/haproxy
    #daemon
    #debug
    #quiet

  defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    retries 3
    option redispatch
    maxconn 2000
    contimeout      5000
    clitimeout      50000
    srvtimeout      50000

  frontend all 0.0.0.0:80
    timeout client 86400000
    default_backend nginx_backend

    acl is_websocket hdr(Upgrade) -i WebSocket
    acl is_websocket hdr_beg(Host) -i ws
    acl is_faye url_sub faye

    use_backend ws_backend if is_websocket
    use_backend faye_backend if is_faye

  backend ws_backend
    option forwardfor
    option http-server-close
    option http-pretend-keepalive
    timeout queue 5000
    timeout connect 86400000
    timeout server 86400000
    server server1 127.0.0.1:9292 maxconn 2000 check

  backend faye_backend
    option forwardfor
    option http-server-close
    timeout connect 4000
    timeout server 30000
    server server1 127.0.0.1:9292 maxconn 2000 check

  backend nginx_backend
    option forwardfor
    option http-server-close
    timeout connect 1000
    timeout server 6000
    server server1 127.0.0.1:8080 maxconn 2000 check
EOL
$ sudo -i
$ cat >/System/Library/LaunchDaemons/org.homebrew.haproxy.plist <<EOL
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>KeepAlive</key>
  <true/>
  <key>Label</key>
  <string>org.homebrew.haproxy</string>
  <key>NetworkState</key>
  <true/>
  <key>Program</key>
  <string>$(which haproxy)</string>
  <key>ProgramArguments</key>
  <array>
    <string>haproxy</string>
    <string>-f</string>
    <string>$(brew --prefix)/etc/haproxy/haproxy.cfg</string>
  </array>
  <key>StandardErrorPath</key>
  <string>/var/log/system.log</string>
</dict>
</plist>
EOL
$ launchctl load -w /System/Library/LaunchDaemons/org.homebrew.haproxy.plist
$ exit

The above configuration has a few assumptions:

  • You are running on OS X
  • You are using homebrew as a package manager
  • You have Nginx running and configured to listen on port 8080

Getting the above to work can be quite troublesome if you had Nginx running as a system service listening on port 80. Console is your best friend in most debugging cases.

Gains

Now your development environment more closely resembles your production environment - You'll encounter most production-related problems earlier. Which is a good thing!

Happy hacking!


Don't write your own ORM-abstraction - Flexirails revised

written by Raphael

Three years ago I started writing a Rails plugin that could create customizable table views which reload via Ajax. The plugin was eventually used for the ICS Festivalmanager which is used to help organize the Wacken Open Air Festival since 2010.

Having re-written the plugin from ground up I want to share some bad design decisions I made while writing the first version of Flexirails.

1. Don't couple your plugin to a persistence layer

Flexirails was supposed to represent parts of our domain objects, in a customizable way. At first I tightly coupled my presentation logic right to the persistence layer ActiveRecord because I added all kinds of search logic right into Flexirails.

It took me some time to realize that this was a bad design decision. Here are some things I added, along with some reasons why it had been a bad thing to do:

  • support for different persistence layers and different versions of the same persistence layer. E.g. supporting ActiveRecord and DataMapper, or different versions of ActiveRecord. Sometimes it's better to leave these decision to the application developer so he can use whatever persistence layer he's using, e.g. the file system, mongo db, whatever.

  • support for all search queries I could think of. E.g. supporting all comparison operators, similarity, MySQL null search etc. While this might seem good at first, it also means that you'll need to implement all queries for a different persistence layer as well, even though it might not support all of them or you might only need a very small subset for your application.

  • support for combinational logic search queries. E.g. (x AND y) or (x OR y). Again this might seem good at first sight, but it's hard to build a good generic UI for this. A problem domain specific search form will outperform your generic approach many many times in terms of user experience.

2. Don't add to much UI customization options

In the original Flexirails version users could reorder columns, hide them, sort by attributes, select multiple rows for batch editing and more. Even worse, users could build specific views, save them under a specific name, manage these views, etc.

Again, this was a bad decision. Here's why:

  • while customizable views might be great for some power users, this point get's really irritating for the user as soon as she switches her PC or user account and can no longer access her own views. Dictating a single common User Interface is performs much better in those scenarios.

  • when adding new columns you need to have a clear upgrade path for old views, so they display the new informations as well. This can be very irritating for your users because you might show columns they never knew existed, or they might not even know that a new column was made available. Also you might end up destroying old views an user created, leading to the worst case scenario of any computer system: not worshiping the users data (and the time the spent creating it).

  • Adding too much functionality into a single view leads to an "one-size-fits-all" problem, where more and more features are packed into a single view. Having multiple views, each with a single, clear responsibility is much better and more easy learnable, understandable for a user - see the Unix philosophy.

3. Don't monkey patch ActionController::Base

At first I did not think about how to properly apply Object Oriented design principals to Flexirails, leading to a custom DSL, which was monkey patched into ActionController::Base so I could configure Flexirails inside of my controllers.

A custom DSL is a big problem since new programmers need to learn it before they can use the DSL. Adding simple features can be problem as well, e.g. supporting different permissions; showing different content to different users.

This became somewhat of a problem since adding all this on a class level scope in your controller tends to get messy pretty fast. Also, DSLs tend to be unintuitive to be extended, so don't do create your own DSL (unless you have a really good reason to do it) - and don't ever monkey patch ActionController::Base.

Conclusion

Whatever you did three years ago, take your time to revisit your old work. You'll be astonished by how much you'll learn just by looking & improving on the code you wrote all those years ago ;)

Also, Flexirails 2 turned out really great - it's completly decoupled from my persistence layer, so I am able to use it with what ever data source I like (currently filesystem, datamapper and activerecord 2.x and 3.x). Also, I've removed both searching and most UI customizations, leading to much cleaner and maintainable code. What I like most:

  • I reduced the total number of code lines in Ruby from ~ 2000 to ~ 128, and code lines in JavaScript from ~ 1400 to ~ 400.
  • It's extremly easy & intuitive to extend and change - also due to the small code size

I might open source it sometime in the future. But that's a separate blog post on its own.