Many enterprises are investing in their next-generation data management platforms, with the hope of democratizing data at scale to provide business insights and ultimately making automated intelligent decisions.
Use cases powered by artificial intelligence are defining these next generation data platforms. As early adopters of these technologies, we’ve gone through plenty of interesting routes, and of course dead ends, to creating a data infrastructure setup that is flexible enough to handle AI-powered workloads.
We’ve mostly gained inspiration from the public clouds, yet we wanted to create an architecture that is usable for those enterprises that cannot go with their data to the cloud. Well at least for now, but should that decision come, then we should be able to easily move the solution or better yet extend it with cloud computing resources.
The design philosophy is very simple. We have on a high level an orchestration layer that sits on top of a set of optimized storage layers. It was a key design decision to separate the storage and the orchestration, but still keeping both layers horizontally distributed. That simply means a job or model can be run on any machine and should always have access to the right storage with the data it needs. Another conscious decision we made was to have the ability to launch rather smaller applications with minimum responsibilities, which lowers the impact in situations when something isn’t working.
For the orchestration, we opt for Kubernetes or similar technologies (Openshift), which is currently the most popular container orchestration platform available. It is designed to easily automate application deployment, scaling, and management. It pools CPU and memory across multiple servers or instances, which is then allocated to applications, regardless of the workload. Data itself can remain on existing storage layers accessible by any launched application from the orchestration layer. Certain storage systems have cloud-native support, which makes it suitable for running on Kubernetes. We, for instance, use it for caching as the storage can be very close to the application, thus reducing latency.
Orchestrating containers provides great flexibility when running data workloads. You can have for instance a different configuration between working and non-working hours, where during the night under or non utilized resources can be used by nightly builds or processing and released when resources are needed by users.
Kubernetes also provides methods to recover from failures automatically. If one of three web application instances fails, it will launch a fourth to cover the demand of the service and deal with the failed instance (pod in Kubernetes terminology) according to set policies.
For storage layers, there are a number of options. We tend to choose that, which best solves the problem and provides the right customer experience. A popular choice with enterprises is connecting the orchestration layer to existing data lake solutions like for instance Cloudera Data Platform. Gauss is a certified Cloudera partner for the CEE region. Our architecture in combination with Cloudera’s solution runs for instance in T-Mobile CZ & SK or at a top Slovak bank.
When customer trust us to modernize their data infrastructure, we have strong focus on the experience we want to provide to users and based on that we choose the best technology to do the job. As a result, this is what customers are saying:
„In ad hoc reports, where analysis over a large history of data is required, the acceleration can be from several hours over Oracle DWH to a few seconds or minutes, depending on the complexity of the query, over Apache Impala,“ Jana Šipická, Tatra banka Slovakia.