Imagine you are assigned to extract sales insights from your data. Along with troves of corporate financials together with other market trends, you are also given access to hours of audio and video files of actual sales representatives speaking with customers. How do you process this in Spark?
Or, consider another scenario where you work for a marketplace and your job is to construct a consumer-facing product catalog. You have a database with hundreds of thousands of SKUs, stock levels, and india rcs data item descriptions, while also hosting millions of product photo URLs provided by vendors. Some URLs are correct, others are broken, and there is no quality control for the photos whatsoever. Where do you start?
Normally, the go-to tool should be the “modern data stack” determined by a combination of data storage, data transformation, data ingestion tools, and various business intelligence tools.
Unfortunately, none of these things will be sufficient to complete the tasks outlined.
To understand why, let’s revisit data processing in the enterprise as it is today – of which the defining characteristic is that it is cloud-based. This was a major upgrade from the local processing mode of the pre-cloud era because we now enjoy infinite object storage, exponentially scalable compute, and a mind-boggling selection of modular data mart components. As it stands, the main features of the modern data stack include: