AWS FreeRTOS is a real-time operating system designed to run on IoT devices to enable them to interact easily and reliably with AWS services. The Over the Air (OTA) update functionality makes it possible to update a device with security fixes quickly and reliably. The OTA Library, a part of the overall OTA functionality that runs on the IoT devices, enables customers to learn of available updates, download the updates, check their cryptographic signatures, and apply them.
OTA system architecture
The figure below shows a slightly simplified end-to-end architecture of the OTA system. On the AWS side we have a Job server that tracks the states of IoT devices in the field. That is, it tracks whether a device is active, what version of vendor software it is running, whether it needs an update, and so on. We also have File servers, which are AWS S3 buckets, that hold updates to send to devices. On the IoT device side we have the OTA Library that tracks the device state, handles incoming requests from the Job server, manages transfer of data from the File servers to the devices, applies the update and reports back to the Job server. The OTA App is the developer’s application code and the OTA PAL is the interface between the OTA Library and device-specific hardware. Communication among the devices and servers happens over MQTT and/or HTTP networks.

The OTA system is a complex piece of software that performs firmware updates reliably and securely --- keeping all devices in a consistent state --- in the presence of arbitrary failures of devices and communication. The heart of the OTA system is an intricate distributed protocol, the OTA protocol, that co-ordinates the execution of the different agents involved.
OTA protocol validation
Apart from unit tests for individual agents like Job server or OTA Library, we have written extensive end to end tests for the whole system. Unlike a sequential system where we more or less know the flow of control through the code, a distributed system has multiple asynchronously executing agents -- like hardware device, network, AWS services, and so on for the OTA system -- leading to scenarios which are unanticipated. In fact, there are so many possible interleavings of individual agents’ executions that it is almost impossible to think through all of them.
This is a well-known problem in validating distributed systems and at AWS we have been using formal methods, in particular model checking, to raise the quality bar on distributed systems. As early as 2011, several distributed protocols at AWS were subjected to model checking [1].
Given the centrality of OTA protocol to FreeRTOS, to achieve higher code quality than is possible with standard testing, we decided to use formal methods to validate the OTA protocol. In particular, we used model checking to thoroughly explore a model of the OTA protocol and uncover tricky corner-case scenarios.
Model checking
Formal methods make use of mathematical logic and search techniques to reason about program correctness. Model checking is a notable technique in this area that, starting from a program description and a correctness property, exhaustively explores all the possible behaviors of the program, checking along the way if each such behavior satisfies the property. Any violation is reported back to the user along with a trace leading to the failure, thus giving the user precise and actionable feedback. As systems that have been model checked tend to be of higher quality and have significantly fewer bugs, it is used in both hardware industry and software industry for verifying critical designs [2][3].
Typically, distributed systems tend to be too large to verify directly using model checking. The standard practice is to create a model of the system that captures the core behavior of system and analyze that for protocol level errors. This approach has been successfully used both in hardware and software industry in designing intricate distributed systems [1][5]. Several model checkers --- including TLA+, SPIN, P --- have proven suitable for this task. We used P [4].
OTA protocol modeling
The first step was to extract from the implementation a model of the protocol that is small enough to be amenable to formal analysis and rich enough to exhibit behaviors that mirror real scenarios. Our goal was to capture the core protocol in detail and analyze precisely those aspects of the OTA system that are hardest to test.
To this end we retained the protocol-specific details, such as the types of the messages the agents send and the protocol state machines of the various agents, and abstracted away all details not relevant to the core protocol. The level of detail in a distributed system can be gauged by the amount detail that messages carry. As an indication of the difference in detail, the actual job documents that are sent by the Job server to OTA Library is an UTF-8 encoded JSON containing about 20 fields. In our model, we simplified the job information to contain only 4 simplified fields, namely job id, job status, update version, and flag indicating http use, as shown below:
enum JobStatus {
EMPTY = 0,
PENDING = 1,
INPROGRESS = 2,
SELFTEST = 3
}
type job = (
id: int,
status: JobStatus,
version: int,
http: bool
);