Welcome to Percona Live Online 2021
Online Open Source Database Conference
Back To Schedule
Wednesday, May 12 • 15:30 - 16:30
Massive Data Processing in Adobe Using Delta Lake

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost-effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.

* What are we storing?
* Multi Source - Multi Channel Problem
* Access Pattern to optimize for
* Custom High Performance Query engine
* Data Representation and Nested Schema Evolution
* PerformanceTrade Offs with Various formats
* Go over anti-patterns used
* (String FTW)
* Data Manipulation using UDFs
* Writer Worries and How to Wipe them Away
* Gotchas
* Concurrency
* Column size
* Update frequency
* Transaction Management for A Healthy State
* Staging Tables FTW
* Why we can't live without them
* Datalake Replication Lag Tracking
* Instrumentation of the data pipeline gives more confidence to the reader
* Downstream Data Pipelines
* Showcase easy building of incremental versions of applications
* Maintenance Jobs
* Go over essentials of compaction and vacuuming
* Performance Time!
* What scale are we operating at?
* Settings like autoCompact and optimizeWrite
* Timings With and Without Delta
* Cost

avatar for Yeshwanth Vijayakumar

Yeshwanth Vijayakumar

Sr. Engineering Manager/Architect, Adobe Systems Inc
I am a Sr. Engineering Manager/Architect on the Unified Profile Team in the Adobe Experience Platform; it’s a PB scale store with a strong focus on millisecond latencies and Analytical abilities and easily one of Adobe’s most challenging SaaS projects in terms of scale. I am actively... Read More →

Wednesday May 12, 2021 15:30 - 16:30 EDT
Room #6