2TB of new data is added to the system daily
As the subsidiary of the 3rd largest telco operator in the world (Deutsche Telekom), the amount of data that T-mobile (regionally: Slovak Telecom) processes every second is almost unthinkable. Their challenge was that they were ready to start incorporating big data technologies into their operations, but for several years prior were concerned that the implementation costs might not have a positive ROI.
As data technologies advanced, the costs fell, and eventually they sought out our help to help set their data architecture. They still wanted to be conservative in their investment, so they called on us to set things up in such a way that it could be managed by a relatively small team of engineers.
We chose an architecture capable of processing batch and real-time data that wouldn’t need vast engineering work to enable data flows and process data. Efficiency, cost management, and data processing speed are all extremely important for huge companies like this. For their needs, we decided to implement a Cloudera Data Platform (Gauss Algorithmic was the first company to become a fully certified Cloudera partner in the Czech Republic).
Within the architecture, we implemented the Cloudera distribution of Hadoop with Apache Kafka and Apache Spark. Tools like Apache Flume and Apache Sqoop were selected to ease the amount of data engineering required to process data at the beginning, whereas today engineers have adopted Apache Spark for better performance.
The system runs as more logical clusters that are independently scalable. The data integration is taken care of by two clusters with Apache Kafka as the core open-source technology. They work mainly as a protection layer to reduce the amount of “dirty” data flowing to the desired storage layers. We run these integration layers on the operator's private cloud infrastructure, allowing us to add resources much faster and tackle unexpected spikes in the data flows.
The data lake itself takes care of further processing and analytics including machine learning. This cluster runs on bare metal hardware for better performance gains. Resource management is fine-tuned to the needs of different teams like IT or data scientists accessing the platform. Python notebooks and SQL-based access methods are available to these end users.
The solution complies with the required security and data protection levels. Overall, the system today runs critical workloads, accessing more than 10 internal and external data sources, which helps make marketing and operational decisions on a daily basis.Big data solutions are not one-size-fits all. We were able to deliver a much higher ROI on their technology investment simply by listening to their exact needs and pain points, and implementing a solution that met their specifications while still being affordable and operable with a smaller team of engineers.