Scalable remote management of embedded Linux devices via WebSockets

Airtame Cloud is a web-based system which enables organizations to remotely monitor and administrate large deployments of Airtame wireless streaming devices. In this post, we explore how Airtame devices communicate with Airtame Cloud via WebSockets and how we scaled our backend systems to handle increasing numbers of users and devices. Airtame Cloud provides an overview of […]

Airtame Cloud is a web-based system which enables organizations to remotely monitor and administrate large deployments of Airtame wireless streaming devices.

In this post, we explore how Airtame devices communicate with Airtame Cloud via WebSockets and how we scaled our backend systems to handle increasing numbers of users and devices.

Airtame Cloud provides an overview of an organization’s devices

A Case for WebSockets

All communication between Airtame devices and Airtame Cloud occurs over TLS-enabled WebSocket connections, which the devices initiate. WebSockets have several qualities which make them well-suited for our needs:

  • Real-time. When a device’s state or settings are altered, we want the changes to be propagated through the backend as quickly as possible. WebSockets utilize a persistent TCP connection, so updates can be pushed immediately with minimal overhead.
  • Stateful. Devices only need to identify themselves during the initial WebSocket handshake, after which they maintain a stateful connection with our backend — no session management is required.
  • Bi-directional. The server can push messages to devices at any time, without the devices having to poll for updates.
  • Network friendly. Airtame devices are deployed in networks with a wide range of configurations. Nearly all networks we’ve encountered allow outgoing TCP traffic to port 443, and the WebSocket protocol’s HTTP-compatible handshake is well-suited for proxy traversal. Additionally, since the device makes the initial connection to the Airtame Cloud backend, there’s no need to worry about NAT or firewall restrictions on incoming connections.

Connection Establishment Flow

In order to establish a WebSocket connection, a device must present its authentication token for verification as follows:

  1. The device initiates a secure WebSocket (WSS) handshake request, providing its token as a request header.
  2. The handshake request is picked up by nginx, which performs TLS termination and routes the request to a backend server.
  3. The server verifies the device’s token against a hash in the database.
  4. Assuming a valid token, the server issues a handshake upgrade response via nginx, which re-applies TLS and forwards the response to the device.
  5. Now fully-established, the WebSocket connection serves as a bi-directional communication channel between the device and server.
Websocket connection establishment

Side note — we’re enthusiastic Go users, hence the adorable Gopher cloud representing our backend in the diagram.

JSON-RPC

After establishing the WebSocket connection, the device and backend communicate by exchanging JSON-RPC messages, of which there are two types:

  • Requests, where commands initiated from Airtame Cloud are remotely executed on devices, for example, triggering firmware updates. Once a device has executed a command, it responds with the result of the command execution, including a unique id extracted from the request which correlates the request and response.
  • Notifications, where devices inform the backend of changes in their state or settings, for example, communicating a modification to the dashboard URL displayed on the Airtame device. A notification does not contain an id or warrant a response.
Users can interact with devices, for example by triggering firmware updates

Remote Device Management

Consider an example where a user initiates a remote firmware update on a device with id 42:

  1. The user’s browser sends a POST request to the API endpoint: /devices/42/update
  2. The request is picked up by nginx, which performs TLS termination and routes the request to a backend server.
  3. The server verifies that the authenticated user has permission to update the specified device.
  4. We create a JSON-RPC request containing a random id, for example:
    {“id": 1234, “method": “update", "jsonrpc": "2.0"}
    Since we operate multiple backend servers, and nginx has no knowledge of which Airtame devices are connected to which servers, we need to route the JSON-RPC message to the server with an active WebSocket connection to device 42. For this, we use Redis Pub/Sub, first subscribing to a channel where the response to the JSON-RPC request will eventually be sent, deviceAction:1234, then publishing the JSON-RPC request to the channel device:42
  5. Redis delivers the JSON-RPC request to the backend server subscribed to the device:42 channel.
  6. The responsible backend server transmits the JSON-RPC message via its WebSocket connection with device 42
Routing a JSON-RPC request with Redis Pub/Sub

When device 42 receives the JSON-RPC update request, it parses and executes the command, initiating a firmware update on the device, then returns a JSON-RPC response containing a successful 200 status code to the responsible backend instance via WebSocket:

{
"id": 1234,
"params": {
"status": 200
},
"jsonrpc": "2.0"
}

This backend instance then publishes this response to the deviceAction:1234 channel, which the server that handled the initial POST request is subscribed to. Finally, the status code and any additional information are used to construct an HTTP response, and the user is informed whether the update was successfully executed or not.

On Scaling Horizontally

When we first released Airtame Cloud, a single backend server handled all of our traffic, including managing device WebSocket connections. As our user base increased, the server struggled to keep up, and we decided to deploy additional identical instances of our backend, or scale horizontally.

Since we were already using nginx as a reverse proxy in front of our single server, it was straightforward to set up round-robin routing between multiple servers. However, this introduced the need to re-route requests for specific devices to the servers maintaining WebSocket connections with those devices. We were pleased to discover that Redis Pub/Sub fit our use case quite naturally, and we’ve found that by using the JSON-RPC message id to identify the Pub/Sub response channel, reasoning about the flow of messages through the system is straightforward.

Now we can seamlessly scale our backend up or down as needed without changing any code.

Final Thought

It’s worth noting that gRPC, which had its first GA release in August 2016, could serve as an interesting alternative to our JSON-RPC-over-WebSockets approach. The use of HTTP/2 for transport and Protocol Buffers for serialization appears very promising for efficient bidirectional communication.

One potential drawback of gRPC in our case is that since HTTP/2 is stateless, we’d need to implement our own session management. Additionally, we really value the simplicity and readability of JSON-RPC. Nonetheless, the gRPC project looks very promising, especially with regard to performance.

If anyone has implemented a similar system using gRPC, we’d love to hear about your experience!

We’re always on the lookout for talented engineers who enjoy tackling challenging problems and are passionate about writing clean, maintainable code. If this sounds like you, check our our open positions and get in touch!


Scalable remote management of embedded Linux devices via WebSockets was originally published in Airtame Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Airtame