A simple scalable web server HA architecture suitable for medium sized projects

Having deployed and maintained several public medium sized web sites running CubicWeb when I worked at SecondWeb, I was asked by my friends from Logilab to write a blog post describing how we managed our deployment while working with the customer and the hosting company.

Non technical (albeit important) considerations

Customers that want to run such a medium traffic web site either tell you which hosting company they partner with, or ask you to find one, so you have no other choice to deal with an external hosting structure to manage the servers. I prefer this by the way because:

High Availability (HA) hosting really requires skills and hardware that are neither common nor cheap;
HA hosting requires 24/7/365 availability that SecondWeb could not (and did not even want to) offer.

It is clearly difficult for all parties (try to put yourself in the shoes of the customer...) to manage a website with 3 partners involved, each with their own goals. From the development leader point of view, you will notice that the technical people of the hosting company continuously change and you keep seeing the same operational errors even if you provide and keep improving high quality documentation. The software upgrade documentation has to be particularly clear as it greatly influences the overall web site availability. You also have to keep an history of the interventions on the servers yourself and maintain an up-to-date copy of the configuration files.

The overall architecture proposed here partly benefits from this experience with managed hosting company, in that we tried to keep it simple.

Which traffic size ? Why not bigger ?

The architecture proposed here has been successfully tested with sites delivering web pages to up to 2 millions unique visitors per month. It should scale further up depending on your site database access needs: if you need very fresh data and have a lot of write operations to the database, you will need to distribute database access amongst several servers, which is beyond the scope of this post.

This is the main limitation of the proposed architecture and the reason why it is not well-suited for a bigger traffic.

Design choices

Load balancing - Preserve user sessions

To achieve very high availability for your web site, you must have no single point of failure in the whole architecture, which can be far from reasonable from the costs point of view. However, hosting companies can share costs between their customers and have them benefit from a double network infrastructure all along the way from the Internet to your web servers, themselves hosted on two distant locations. You may then choose an even number of web servers, half of them hosted on each network infrastructure.

The important thing is that you must preserve user sessions. As of CubicWeb 3.10, DB persistent sessions have not been implemented yet (it will soon, there is a ticket planned for this functionality), thus you must preserve session cookies by always directing a given user to the same web server, which is usually achieved by configuring the load balancer(s) in IP hash mode (it is faster than balancing on the session cookie, which implies reaching the http stack rather than staying at the TCP/IP level).

Squid caching, processor load balancing

Now if you have multi-processor web servers (which is very likely these times) you will need to use one CubicWeb application instance per processor or the Python GIL will limit the CPU of your application to a fraction of the available power. This is pretty easy, you just have to duplicate configuration directories from /etc/cubicweb.d, changing instance names and ports. You can use a simple sed-based script to generate these copies automatically and keep them in sync.

Now that we have one instance per processor, the problem of preserving sessions is back. It can be elegantly solved using Squid, which can of course deliver cached objects (in particular images, more on this later), but also listen on several ports and distribute incoming requests evenly among the CubicWeb instances based on their port of origin. Note that the load balancer must be set up to balance between ports of the web servers, one port for each processor. The Squid configuration file to achieve this, looks like:

http_port 81 defaultsite=www.example.org vhost
acl portA myport 81

http_port 82 defaultsite=www.example.org vhost
acl portB myport 82

acl site1 dstdomain www.example.org

cache_peer 127.0.0.1 parent 8081 0 no-query originserver default name=server_1
cache_peer_access server_1 allow portA site1
cache_peer_access server_1 deny all

cache_peer 127.0.0.1 parent 8082 0 no-query originserver default name=server_2
cache_peer_access server_2 allow portB site1
cache_peer_access server_2 deny all

This is a way to setup Squid to listen to ports 81 and 82 and distribute requests for www.example.org to ports 8081 and 8082 respectively. This way, requests should be evenly balanced between the processors a on bi-processor web server.

You can now setup Squid more classically to achieve what it is initially done for: caching. See Squid docs for this, particularly the refresh_pattern directive. Note you do not need to force any HTTP cache standard feature in Squid, as CubicWeb enables you to fine tune caching using simple HTTPCacheManager classes found in cubicweb/web/httpcache.py (at the end of this file, you will also find default cache manager configuration for the entity and startup views).

CubicWeb with Apache frontend

This is controversial but it did not hurt for me: I like to put an Apache frontend between Squid and the Twisted-based CubicWeb application, because the hosting companies are usually pretty good at setting it up, like to use server status for monitoring, mod_deflate for textual content compression, mod_rewrite and other modules to customize, monitor or fine tune the web servers.

It can however be argued that Apache is a huge piece of software for such a restrictive usage, and its memory footprint would be better used for caching.

No shared disk

This is an interesting part that simplifies the overall setup: if you want to save data on disk, it is likely that you also want to keep it in sync between the web servers, or use a highly secure network storage solution.

As we already have a data store accessible from the web servers, namely the database itself, I often choose to use it even for images. This looks like the nightmare of every sysadmin, but if you make sure the images are not fetched every second from the database, by using fine tuned cache settings, it will not hurt. And this way you still benefit from the flexibility of a database and the easier maintenance of a single data store. We can use CubicWeb cache settings to allow squid caching images for 1 hour for example. If you have a very dynamic web site however, you will then need to force a URL change when an image is edited. This can easily be achieved in CubicWeb using a custom edit controller that creates a new image when the data attribute of an Image instance was edited, as illustrated here:

from cubicweb import typed_eid
from cubicweb.selectors import yes
from cubicweb.web.views.editcontroller import EditControllerclass CustomEditController(EditController):
select = EditController.select & yes()def handle_updated_image(self, old_eid):
'modify submitted form to change old_eid into a new entity eid in all key/ values'
old_eid = unicode(old_eid)
form = self._cw.form
new_eid = self._cw.varmaker.next()
# handle image eid
del form['__type:%s' % old_eid]
form['__type:%s' % new_eid] = u'Image'
# handle eid list
index = form['eid'].index(old_eid)
form['eid'] = form['eid'][:index] + [new_eid] + form['eid'][index+1:]
# handle attribute and relations
for (k, v) in form.iteritems():
if v == old_eid:
form[k] = new_eid
if k.endswith(u':%s' % old_eid):
form[k[:-len(old_eid)] + new_eid] = v
del form[k]def _default_publish(self):
# implement image creation when data image was updated, so that we can use
# a far expiry date cache on download view
images = []
for (k, v) in self._cw.form.iteritems():
if v != 'Image' or not k.startswith('__type') or k == self._cw.form['__maineid']:
continue
try:
eid = typed_eid(k[7:])
except ValueError:
continue
if self._cw.form.get('data-subject:%s' % eid, None):
self.handle_updated_image(eid)
images.append(eid)
super(CustomEditController, self)._default_publish()
for eid in images:
self._cw.execute('DELETE Image I WHERE I eid %(eid)s', {'eid': eid})

To add the 1 hour expiry date for image download view, you can use:

from cubicweb.selectors import yes
from cubicweb.web import httpcache
from cubicweb.web.views.idownloadable import DownloadViewclass CustomDownloadView(DownloadView):
select = DownloadView.select & yes()
http_cache_manager = httpcache.MaxAgeHTTPCacheManager
cache_max_age = 3600

Database server

Hosting companies now often have a pretty good knowledge of PostgreSQL, the favorite DB back end for CubicWeb. They usually propose to replicate the database for data safety at a low cost, using PostgreSQL log shipping feature. Note that new PostgreSQL 9 versions should make it easier to setup replication modes that could be useful to improve performance and scalability, but there is still a lack of production level experience for the moment. Please share if you have, because it is the main issue to deal with to scale up further.

Pre-production

This is worth mentioning you need a pre-production server hosted by the same company on the same hardware (or virtual machine), because:

software upgrade will run smoother if the technical staff of the hosting company has already performed the same upgrade operation once: check the same person does both within a short timeframe if possible;
you will feel better if your migration scripts have successfully run on a fresh copy of the production data: ask for a db copy before a pre-production upgrade; this is much easier to do if you do not have to copy the database dumps remotely.
the pre-production server can host its own database server and the replication of the production one.

Monitoring

When you experience a web site downtime, it is much too late to take a look at the available monitoring. It is important to prepare the tools you need to diagnose a problem, get used to read the graphs and have the orders of magnitude of the values and their variations in mind.

Even the simplest graphs, like CPU usage, need to be correctly interpreted. In a recent setup, I did not realize that only one CPU was used on a bi-pro server, delivering half the power it should... When you cannot access the machine and use top, you only see the information of the monitoring graphs, so you must know how to read them !

Apart from the classical CPU, CPU load, (detailed) memory usage, and network traffic, ask for PostgreSQL, Squid, and Apache specific graphs (plug-ins for them are easy to find and install for classic monitoring solutions).

For CubicWeb web sites, it is also worth setting up following views and use them for automatic alerts:

a software / db version consistency monitoring
a db pool size monitoring
a simple db connection check view
a view writing the server host name is not interesting for automatic alerts but to see on which server your IP is directed to: this is needed when you do not reproduce the behaviour the customer is complaining about...

There are some classes I use for these tasks. Feel free to reuse and adapt them to your needs:

from socket import gethostnamefrom cubicweb.view import Viewclass _MonitoringView(View):
abstract = True
select = yes()
content_type = 'text/plain'
templatable = Falseclass PoolMonitoringView(_MonitoringView):
regid = 'monitor_pool'def call(self):
repo = self._cw.cnx._repo
max_pool = self._cw.vreg.config['connections-pool-size']
percent = ((max_pool - repo._available_pools.qsize()) * 100.0) / max_pool
self.w(u'%s%%' % percent)class DBMonitoringView(_MonitoringView):
regid = 'monitor_db'def call(self):
try:
count = self._cw.execute('Any COUNT(X) WHERE X is CWUser')[0][0]
self.w(u'ServiceOK : %s users in DB' % count)
except:
self.w(u'ServiceKO')class VersionMonitoringView(_MonitoringView):
regid = 'monitor_version'def versions_text(self, versions):
return u' | '.join(cube + u': ' + u'.'.join(unicode(x) for x in version)
for (cube, version) in versions)def call(self):
config = self._cw.vreg.config
vc_config = config.vc_config()
db_config = [('cubicweb', vc_config.get('cubicweb', '?'))]
fs_config = [('cubicweb', config.cubicweb_version())]
for cube in sorted(config.cubes()):
db_config.append((cube, vc_config.get(cube, '?')))
try:
fs_version = config.cube_version(cube)
except:
fs_version = '?'
fs_config.append((cube, fs_version))
db_config = self.versions_text(db_config)
fs_config = self.versions_text(fs_config)
if db_config == fs_config:
self.w(u'ServiceOK : FS config %s == DB config %s' % (fs_config, db_config))
else:
self.w(u'ServiceKO : FS config %s !$ DB config %s' % (fs_config, db_config))class HostnameMonitoringView(_MonitoringView):
regid = 'monitor_hostname'def call(self):
self.w(unicode(gethostname()))

Sketch of the architecture and conclusion

There is a sketch of the proposed architecture. Please comment on it and share your experience on the topic, I would be happy to learn your tips and tricks.

I would conclude with an important remark regarding performance: a good scalable architecture is of great help to run a busy web site smoothly, however the performance boost you get by optimizing your software performance is usually worth it and must be seriously considered before any hardware upgrade, may it seem costly at first glance.