Big Data Operations with Cloudera

A practical workshop on bringing the modern data platform into your enterprise environment. This workshop has been created based on our experiences with delivering enterprise grade data solutions. During the lifecycle of any Big Data system the data engineering teams will face requests and requirements coming from departments responsible for data quality, monitoring or security. Knowing best practices based on real world scenarios will help attendees to get ready on day one and ease their path in managing these complex environments.


29. 11. 2017 to 30. 11. 2017

The workshop is 40% theory and 60% hands-on, where each participant will create his own cluster setup within a provided cloud environment and through his/her github account will submit completed work. The advantage of this practice is that the workshop tasks can be replicated later on in the same cloud environment.

Why attend

  • Your company has decided to adopt big data or machine learning and has asked you to figure out the implementation.

  • A consulting company has left you with an expensive Hadoop cluster not doing anything or being managed in any way.

  • You’ve opted for open source and realized it’s not that easy to meet enterprise demands.

  • You plan to use software in the Cloudera stack for your end solution.

Required knowledge

  • Basic understanding of the Big Data landscape

  • Basic Linux

  • SQL

  • Advanced English language (the workshop is primarily in English)

What do you need for the workshop

  • Laptop with working Wi-Fi

  • GitHub account (private or public)

Workshop outline


  • Big Data and Hadoop
    A brief reminder to why these systems exist and what are the benefits of adopting these technologies.

  • Why use Gauss and Cloudera
    Understanding the pros/cons of adopting open source software and how can companies like Gauss Algorithmic and Cloudera help.

Big Data Projects

  • Planning
    A discussion on what do you need to get a Big Data project running.
  • Deployment options
    What are the advantages of running Big Data environments on-premise, hybrid or cloud.
  • Sizing
    How to figure out what is the right size cluster for my project. How do to predict scale.
  • Security and compliance
    A closer look on architectural designs with security in mind and discussing topics like authorization, authentication, transparency and encryption.
  • Best practices
    Discussing people and skills, data governance, external monitoring or high availability.

LABS: Building a cluster in the cloud
In this lab each participant will create the latest Cloudera CDH cluster using Cloudera Director in a public cloud environment.

Data Integration

  • Connecting to the existing enterprise ecosystem
    What are the best methods of collecting files, exporting data from RDBMS systems or connecting to cloud storages.
  • Real-time streaming
    Discussing what options are there for realtime streaming and things one should consider during implementation.
  • Best practices

Storing Data

  • Hadoop Distributed File System
    A closer look at the distributed file system and when it’s best to use it. We’ll also cover some basic operational tasks and health checks.
  • New storage technologies (Apache Kudu)
    Kudu is a newer storage technology and it’s been developed to tackle use cases especially in the IoT domain. We’ll introduce this technology and compare it with other storage options.
  • Apache Kafka
    This is the go-to technology for real-time data processing. Though Kafka’s functionality is rather simple it can be a challenge to understand what’s really happening inside. This discussion will be about basic setup, operation and monitoring.


Real-time ingestion
Creating a basic realtime pipeline with Flume, Kafka and Kudu.

Basic operations on HDFS
Checking out most common operations on the hadoop distributed file system. Moving data around. Setting file permissions. Removing data and trash

Kafka Messaging System
A fun and simple way to play around with Kafka is to create a messaging system, where topics are using as a group chat space.

Accessing Data

  • SQL (Impala, Hive)
    SQL is still relevant as it offers and well-known and simple way for data analyst to query data. Impala and Hive are among the most popular and commonly used engines in the Big Data world. We’ll take a closer look at these engines and see where they work best.
  • Apache Spark
    Apache Spark needs little introduction as it’s one of the world’s most adopted open source projects. It’s a great framework for large scale data processing but it’s wise to understand when and how to use it correctly.
  • Resource Management
    In large enterprises it’s very likely that clusters are multi tenant. They will carry diverse workloads and therefore knowing how to correctly manage resources is vital to keep end users productive.


Playing with HUE (Hadoop User Experience)
HUE is a basic tool within the Cloudera environment that simplifies access to various data on HDFS and other supported storage spaces. This exercise goes through the basic features.

Connecting BI tools

Preparing for data science

  • Clusters for data science workloads
    Data science utilizing techniques like machine learning are opening new innovation paths for enterprises. To get the most out your data researches should be able to access all data with the freedom to work how they want, when they want. We’ll look at latest available options.

LABS: Implementing Notebooks (Jupyter, Cloudera Data Science Workbench)

Where: Rohanské nábřeží 671/15, Praha 8

Capacity: 6/10


Johnson Darkwah

Big Data Solution Architect

Johnson Darkwah – Guass Algorithmic Follow us on Linkedin

Specialises in designing komplex Big Data solutions. Many years of experience in the telco industry. Ice hokey and football player.

Price: 24 000 CZK (excluding of VAT)