How to Build a REST Implementation that Scales Better Than SMPP


There is myth in the wholesale SMS industry that you can only achieve scale with SMPP. Here are few tips from Paul Cook, our senior architect, on how to achieve scale with Vonage REST API as alternative to SMPP. By applying these principles, its possible to build a REST integration that rivals a well put together SMPP integration. And will blow a badly built SMPP integration out of the water. It comes down to a few basics really.


Use http1.1 over http1.0. That is, make use of connection keepalive, avoiding the overhead of socket establishment round trips on each request.


Make use of an executor/worker-thread style design pattern to ensure that multiple requests are happening at once. One of the biggest scaling pitfalls of making REST requests is running your dispatching in a single thread and keeping everything serial. Individual HTTP requests have a degree of latency in them, both in terms of the underlying tcp round-trip time to get the data to the service and retrieve the response. As well as, as a protocol level, with HTTP being, by its nature, a serial question-and-response type protocol. Using multiple concurrent requests will help absorb this latency.

Think TCP sliding windows, with its notion of many packets in-flight waiting to be acknowledged. If the TCP stack in our machines did not do this and sent each packet one at a time waiting for responses, the whole internet would immediately grind to a halt.


It can seem counter-intuitive that, in order to go fast, it is necessary to throttle, but consider this. If you blast away transmitting data as fast as you possibly can, you will quickly exceed either the capacity of the other end to respond and keep up, or exhaust the capacity of the pipe in between to move all of this data and keep up. When this happens, this will have one of a number of effects.

The tcp stack will lose packets that will need to be re-submitted. This causes a delay as you wait for timeouts to occur to trigger this. Also, this generates even more traffic on the pipe that simply compounds the issue.

Or, your app may be receiving throttle NACKs, or no response at all, which, depending on your application logic, may be triggering a back-off, wait, and retry mechanism (for example, if your requests are generated as the results of messages in a queueing framework). This again causes delays waiting for timeouts, and additional traffic to be generated.

The cumulative effect of these delays can in fact cause the actual throughput to be far lower that if you implemented a throttle and submitted slower, but at a steady rate that is within he capacity of the receiving system, and of the pipe in-between.

Low-Latency Responses

In the scenario where you are implementing a service that is receiving callbacks for incoming messages, and you are expecting to handle a large volume of requests in a short space of time, it is vital that the receipt and acknowledgement of these requests is as quick as possibly.

It is likely that upon receiving a message callback, your system will have a number of things to do. It may wish to log the request in a database, it may wish to update some totals against an account. It may need to execute some complex business logic to perform some actions on one of your users accounts as a result of the contents of the message.

Each and any of these activities may in themselves be a lengthy process and rely on external resources such as database servers, locks on database tables or making further web service calls to other services to perform further actions.

This can all add up to an action that takes a very long time to execute, maybe many hundreds of milliseconds, maybe even many seconds.

A common pitfall is to follow the anti-pattern where the callback request is received, which immediately executes a sequence of events similar to that described above, potentially waiting on a shared lock, making remote web service calls, or waiting for slow round-trip access to a database. Only when these actions are complete does the system then acknowledge receipt of the callback message.

This sort of pattern does not scale beyond a very small volume of traffic.

The service performing the callback requests to your application (in this case, the Vonage Messaging platform) will be operating its own throttling and flow control mechanisms. If you take a long time to acknowledge messages, Vonage will send you more requests in parallel. But this parallelism is limited. Only a certain number of requests will be made until the acknowledgments of at least some of those requests have been received. Again, think TCP sliding windows.

Thus, taking a particularly bad example where the requests take five seconds to respond, if you have a large number of incoming messages, you will quickly receive five requests, then receive nothing until those five requests are acknowledged (causing effectively a five second stall in the traffic flow), and each time this window is fully used up, the traffic flow will stall again.

In order to receive high volumes of messages rapidly, it is thus vital that these requests are acknowledged quickly. Common patterns and frameworks are widely available in order to detach the receipt and acknowledgement of a request from its execution. At a basic level, Java developers have the executor framework available to dispatch the request to a separate execution thread. Similar frameworks are available for all common languages.

At a more enterprise level, consider making use of a queueing framework that will allow you to accept a large volume of requests quickly, and deal with them as fast as your application is able to.


As the volume of requests grows, it is inevitable that, in time, the hardware requirements of your infrastructure will increase in order to be able to serve all of the requests in a timely manner.

Typically, this will involve installing a number of instances of your application sitting behind a load balancer. Incoming end-user web requests are spread across your farm of servers using a variety of algorithms (maybe round-robin, maybe random, weighted differently for different servers, with sticky or floating sessions). These will all depend on the individual needs of your application.

As well as the end-user web requests, the same is also true for incoming web service requests. Your farm of servers sits behind the load balancer’s virtual address and each receives a sub-set of the requests. As the volume increases, you add more servers to handle more of the load.

This is, of course, painting a simplified picture of the scenario. There will be other factors your application needs to consider to scale such as database capacity, lock contention or shared cache infrastructure. These are all way outside the scope of this article. The overriding principle of load balancing of the requests remains true though.

This brings with it one of the key advantages of a REST based approach over a more traditional approach of using a persistently connected socket and communicating with a protocol such as SMPP. SMPP is a highly scalable telecoms protocol, but brings with it a number of challenges, one of the biggest of which is, how to scale across multiple servers.

Persistent sockets protocols, by their nature, are heavyweight beasts and an allocation of available binds is usually a restricted resource. It will be rare to encounter a provider who is happy for you to open a large number of sockets from a large number of servers and keep them open permanently. This means, that as your number of servers grows, it may not be possible to have them all take part in the receiving of message requests. This can be a serious roadblock to being able to scale your infrastructure quickly to meet changing demands.

Additionally, the principle of such an integration is, you bind your socket to the destination service, and await requests to arrive. These are direct point-to-point connections that your application must establish. There is no automatic means of switching that connection between, say, a master and backup server in the event of failure, or shaping the flow of traffic so that more requests are sent to your servers with more capacity than the servers that are busy. This is completely out of your control!

With a REST implementation, there is no persistent socket to establish, there is no overhead in maintaining that socket, in performing keep-alive cycles and managing the lifecycle of establishing and closing down the connection. Instead, you receive a stream of requests that you can shape and direct according to your individual requirements.

If you have a server that is twice as powerful as some of your older kits, you can set up a load balance policy to send more requests to that server. If you have a server you want to keep as a warm standby, you can do so, and only start sending it requests when circumstances dictate. The important distinction here is, you are in control, and are free to grow and scale your infrastructure in which ever ways are required.

Was this article helpful?
9 out of 9 found this helpful
Have more questions? Submit a request