Waylay Data Ingestion layer
Waylay is best known for its amazing automation platform that combines the power of the patented rules engine with a unique digital twin concept. Nevertheless, in this blog post I would like to cover one less known, but not less amazing piece of Waylay’s technology which goes by the name “Data Ingestion Layer”.
The task of Data Ingestion Layer is to connect, collect, store and forward data into the Waylay automation/ML stack. Although this sounds like a rather trivial task, let us see what are the challenges that are resolved by these sets of data microservices that are highlighted by the red box in the picture below (channels, converters, webscripts etc..)
Why is this a more complicated problem than it sounds? If Waylay was simply an ETL offline analytics tool, then “ETL/ML” pipelines (often referred to as MLOps) would be the only thing required. Since Waylay aims to be used both in analytics use cases and stream processing use cases, where real time and near real time automation processing is extremely important, ETL data normalization and aggregation philosophy is simply not good enough.
Stream processing engines can deal with data streams, but they also assume that protocol/payload bridging is done upfront, that’s someone else's problem. Stream processing tools also can’t be combined with external systems, like digital twin configurations or external CRM systems (in context of rules definition and execution, as all input must be in the data stream itself).
Therefore, you will often find solutions that deal with only one part of the problem, either dealing with stream analytics or offline analytics (extract, transform, store, report), or when combined, like in case of kafka consumers, create a rather complicated and hard to maintain infrastructure.
We also sometimes find tools that deal with real time data - only one protocol at the time, like node-RED, where for instance “inject” nodes terminate MQTT, and therefore you can’t apply the same rule to LoRa sensors for exactly the same use case, as this data would come via webhooks, which means the same template would not be applicable, as the protocol stack and payload transformation are hard coded within a rule.
When we look at AWS based solutions, customers often go for device connectivity using MQTT hosted servers. In that case, if automation requires integration with external systems, or other IoT systems such as LPWAN, data paths can be different as it might arrive via event bridges, lambdas, data logs etc.
In that case, bringing all these data sources in one place at the same time, in order to do even some simple stream processing tasks becomes a huge challenge. Lambda event based processing, or stream rules engine is only adding into the complexity mix, even more so when combined with AWS step functions or offline analytics triggered by S3 object events.
Waylay’s philosophy is that streams and all other data sources should be intercepted and normalized as early as possible, such that no other microservices within the Waylay platform are affected by different protocols and message formats. So, let us see what are the biggest challenges that Data Ingestion Layer is addressing:
- Discovery - which identities are we collecting data for?
- Payload transformation
- How data is collected
a. Is data coming as a stream?
b. Is data coming via webhooks?
c. Is data collection happening by polling external endpoints?
- Providing data for further processing
a. In context of real time automation
b. Offline analytics (often used for planning and reporting)
c. ML processing (real time and near real time)
Tapping into 3rd party IoT Platforms, MQTT or Pub/Subs
When devices are already connected via MQTT or third party IoT platforms, Waylay channels (pub/sub framework) tap into the 3rd party stream as a consumer to that bus. Waylay Channels are composed of two different services, one that does subscription and data consumption, optionally followed by additional data payload conversion service, should that be necessary.
In that case Waylay holds credentials to 3rd party platforms, hence it is a paramount that these connection settings are stored in a secure way. In Waylay's case this is provided by the Vault service that encrypts these values at the “rest” which are then at runtime provided to the underlying services - in this case connectors.
When tapping into other data buses, the first challenge is to discover identities of entities within a data stream. That might sound strange at first, as identities are always defined by IoT platforms or embedded devices, not something that is defined in another place. Still, where exactly this identity is encoded within a data stream is rarely known upfront.
For instance, if we have an edge device that is connected to the MQTT Broker to which Waylay is subscribing to, then one payload might hold data for more than one entity. In another example, if the payload is only for one device, we don’t know upfront where the device identity of this device resides? Is the identity encoded in the payload (and which field), or is the identity in the topic or part of the topic name?
In order to deal with this challenge, Waylay has come with two different services, both with pros and cons:
- a webscript (lambda filter) that can be attached to every incoming message.
- a payload converter, based on the velocity templates, which intercepts, convert and forward messages towards the Broker
Even though webscripts are more flexible, easier to setup and test, please keep in mind the function fan out for every incoming message, which might impact other systems. Also such a solution is more expensive compared to the velocity based solution.
Data integration via REST or Websockets
In some cases, data collection, payload transformation is already in place by 3rd party providers. Instead of Waylay being responsible for subscribing to a 3rd party bus, which is essentially what the channel does, customers can also create clients within their own environment and then either use (HTTPS) REST or Websockets endpoints of Broker to push data towards Waylay. Another difference to channel configuration is that in this case, the customer holds Waylay API credentials in its environment.
We will discuss LPWAN in the context of stream processing, since the discovery and payload transformation pattern is similar to stream processing. The main difference compared to stream processing is that we don’t use a pub/sub channel. Also in this case a 3rd party system has to hold part of the Waylay credentials (only related to the data path, not Waylay API keys).
Data throughput is often low compared to the stream processing use case, making webscript integration an economically viable option, as the function fanout is not a problem. When data comes via LPWAN devices, such as Sigfox, LoRa or NBIoT (e.g. Sigfox backend or private/public LoRa servers) these systems are forwarding data via webscripts to Waylay.
In this case, our customers create first webscripts in Waylay, which are then configured in these backends (as external URLs). Each webscript is secured by the hmac-sha256 algorithm, and in case that additional bitmap decoding is required, this can be achieved either by decoding in the same webscript, via private NPM package or by delegating decoding to the transformers.
Collecting data via external API endpoints
When collecting data over API endpoints, the Waylay system is responsible for data collection. In this example, Data Ingestion Layer is not in use, or at least, not in terms of identity discovery and payload conversion, only the Broker service (for data storage and data stream injection). This is managed within the automation service as a collection task that is responsible for device discovery, payload conversion and storage and forward.
External endpoints might already have data stored (if the data collection is in place), or they make a measurement at the moment that API is called (such as weather forecast). In any case, then the pattern is always similar: the Waylay system periodically calls these endpoints and stores and forwards data within the Waylay realm. In some cases, Waylay doesn't even need to store or forward data, only to use it in the context of automation within the same rule. In most cases, the collection templates looks as the following:
Collection tasks are mostly run periodically, for instance every 15 minutes. What is important is that these collection tasks are spread over time, such that the fanout to the external endpoints don’t come at the same time. If that is not of a concern, then these tasks can as well be executed using cron expressions (e.g. exactly at the hour). In most cases, external endpoints allow us to specify the window in which data is stored in another system, making polling tasks still a preferred choice of collecting and storing data, since Waylay makes sure that these tasks are equally spread over time.
If the collection node (first on in this picture) after the API call holds data for multiple identities or the result of the API call is still very complex, then when we can pass the result of that sensor to the next function (in this case transformer node) that might do additional calculation or create a payload where these different identities are discovered and adjusted. That finally leads us to the third stage, where the template designer decides whether to only store data or also to forward data to the processing automation tasks as well.
Interesting thing to note is that the discovery phase and device identities must be known upfront, unlike in the case of using a channel framework as the task needs to know which exact API endpoint to call. We often combine this feature with resource types, which means that for each family of devices we would set up different collection tasks, which are then automatically initiated based on the resource becoming the part of that resource type set. Authentication is often achieved by either API keys or OAuth2 (flow), which are stored in our Vault and only provided to the collection tasks at runtime.
What does all of that mean in practise?
Imagine an HVAC that sends temperature, pressure, vibration and other telemetry data over MQTT. The same room in which the HVAC is placed might also be equipped with a door/window’s sensors connected over LoRA, which send data over the webhooks. When any of the HVAC parameters shows unusual behavior, the system should send a notification, such as an e-mail, SMS or create a Salesforce ticket.
But what if it was just too hot outside and someone forgot to close the window, while keeping the airco running over the weekend? If that’s the case, the machine is fine, there would be no need to have the technical support brought into it, but only someone to get there and close the window. Technically, you often see similar uses in track and trace.
For instance, consider a truck that is handling transport of COVID vaccines. where the vehicle telemetry data is gathered and sent over the GPRS while for instance refrigerator data or door opening/close events are sent over LoRa or Sigfox network.
Or instances where your company has more than one MQTT server, or multiple integration points with different IoT solutions. In all cases, the last thing you want to do is to deal with these challenges at the automation phase, since every new extension point would require complete re-work of your automation tools and business processes.
In the example below, you see how easy it is to create such automation scenarios, which are designed once and applied to any new IoT system that you bring into the mix.