Welcome to Percona Live Online 2021
Online Open Source Database Conference
Back To Schedule
Wednesday, May 12 • 06:30 - 07:00
Building and Scaling a Robust Zero-Code Data Pipeline With Open Source Technologies [30min]

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.

With the rapid onset of the global Covid-19 Pandemic in 2020 the USA Centers for Disease Control and Prevention (CDC) quickly implemented a new Covid-19 pipeline to collect testing data from all of the USA’s states and territories, and produce multiple consumable results for federal and public agencies. They did this in under 30 days, using Apache Kafka.

We built a similar (but simpler) demonstration pipeline for ingesting, indexing, and visualizing some publicly available tidal data using multiple open source technologies including Apache Kafka, Apache Kafka Connect, Apache Camel Kafka Connectors, Open Distro for Elasticsearch and Kibana, Prometheus and Grafana.

In this talk, we introduce each technology, the pipeline architecture, and walk through the steps, challenges and solutions to build an initial integration pipeline to consume USA National Oceanic and Atmospheric Administration (NOAA) Tidal data, map and index the data types in Elasticsearch, and add missing data with an ingest pipeline. The goal being to visualize the results with Kibana, where we’ll see the period of the “Lunar” day, and the size and location of some small and large tidal ranges.

But what can go wrong? The initial pipeline only worked briefly, failing when it encountered exceptions. To make the pipeline more robust, we investigated Apache Kafka Connect exception handling, and evaluated the benefits of using Apache Camel Kafka Connectors, and Elasticsearch schema validation.

With a sufficiently robust pipeline in place, it’s time to scale it up. The first step is to select and monitor the most relevant metrics, across multiple technologies. We configured Prometheus to collect the metrics, and Kibana to produce a dashboard. With the monitoring in place we were able to systematically increase the pipeline throughput by increasing Kafka connector tasks, while watching out for potential bottlenecks. We discovered, and fixed, two bottlenecks in the pipeline, proving the value of this approach to pipeline scaling.

We conclude the presentation with lessons learned so far, and some potential future challenges.

avatar for Paul Brebner

Paul Brebner

Open Source Technology Evangelist, Instaclustr by NetApp
Open Source Technology Evangelist at Instaclustr by NetApp. For the last 5 years, Paul has been learning new open source technologies, building realistic demonstration applications, writing blogs, and presenting at international conferences including FOSSASIA, All Things Open and... Read More →

Wednesday May 12, 2021 06:30 - 07:00 EDT
Room #4