Posted on

Howdy, Everyone! In this blog, we are going to build a Customised Cloudera Data Science Workbench(CDSW) Engine with Kafka Python Client.

Before we proceed with Hands-on, I would like to shed some light on CDSW for those who are new to this data science platform.

As per Cloudera, Cloudera Data Science Workbench is a secure, self-service enterprise data science platform that lets data scientists manage their own analytics pipelines, thus accelerating machine learning projects from exploration to production.

  • It allows data scientists to bring their existing skills and tools, such as R, Python, and Scala, to securely run computations on data in Hadoop clusters.
  • It enables data science teams to use their preferred data science packages to run experiments with on-demand access to compute resources.
  • Models can be trained, deployed, and managed centrally for increased agility and compliance.

 

So after learning about this amazing data science platform, my first instinct was to use this for building Kafka Python Client.

In this example, we are using the official Pure Python client for Apache Kafka, and the latest kafka-python version available at this point in time is 1.4.6.

Python client for the Apache Kafka distributed stream processing system. kafka-python is designed to function much like the official java client, with a sprinkling of pythonic interfaces (e.g., consumer iterators).

Version Info: CDSW: 1.5.x & kafka-python: 1.4.6

Let’s get our hands dirty.

  1. At first, we need to create a new Dockerfile for our Custom Kafka CDSW Engine. In this file, we will mention the kafka-python client package that needs to be installed on top of the base image provided by Cloudera.

    Sample Dockerfile:

    # Dockerfile

    # Specify a Cloudera Data Science Workbench base image
    FROM docker.repository.cloudera.com/cdsw/engine:5

    # Update packages on the base image and install Kafka client
    RUN apt-get update
    RUN pip3 install kafka-python

     

  2. Now that our Dockerfile is ready. Let’s build the new image. To build the new image, the host should have Docker binaries installed. Execute the mentioned command to build the Custom Kafka CDSW Engine.

    $ docker build -t kafkapy:latest . -f Dockerfile

     

    Note: For simplicity, I have assumed that you have only one host. But if you have multiple hosts, distribute the new image to all your Cloudera Data Science Workbench hosts using instructions mentioned in Cloudera Documentation.

  3. To verify whether the custom engine is created & loaded successfully, execute the mentioned command. You can view the image created with kafka-python client. 
  4. $ docker images

     

  5. The final step is to whitelist the Image in Cloudera Data Science Workbench for the Deployment.

    Log CDSW Web UI as an admin user >> Click Admin >> Engines >> Add “kafkapy:latest” to the list of whitelisted engine images.

“‘Voila!’ Now you are ready to experiment using your new Custom CDSW Engine and play with kafka-python client.

I will try to put up another blog explaining, how to execute a Simple Python Client to produce & consume from Kafka Topics using CDSW.

In case errors are encountered please try to go through the steps again and troubleshoot the issue. I will be happy to help you guys out, so comments on this blog with your queries.

Ankit

Leave a Reply

Your email address will not be published. Required fields are marked *