Data Profiling - It's Everyone's Job to Uncover Data's Underlying Values

Published: July 27, 2021

Everyone profiles. The question is how? 

 

Profiling data: To examine real data to see its underlying values, structure, legality, patterns, and statistical variations. When profiling data you look at the data itself and don't rely on what the database says about the data type, or what people think the values are or should be. Data Profiles give you a taste of reality and rarely does reality match up perfectly with expectations.

 

Before we examine the reality of the data, be warned that our ideas about data usually conform to some platonic ideal and that profiling shatters this ideal.

 

I’ve justified many a project budget by showing management how lousy the data is and how that poor data is affecting the business.

 

So if everyone profiles, what is the problem?

 

Like anything, there are effective and efficient ways to get a job done, and there are other ways. When I’ve done polls during my talks about “how many people use data profiling or data preparation tools,” I've been shocked by the sub-10% response rates I get. Most people are using SQL. For each table or file, they are running a bunch of similar queries to get a flavor for the critical data. They might be saving the results somewhere, but usually they are not. 
 

Using SQL for profiling is like using tweezers to cut the grass: you’re not going to complete the job nor cut the grass well. 
 

But running lots of pattern-based parallel SQLs is precisely where using the computer helps our feeble human brains.

 

Profiling tools sprouted up to assist Y2K initiatives in the late 90s. After 2000, they needed another market, so they targeted Data Warehousing as the next beach to storm. Since then, I thought more people would realize that data impacts all information projects, and that using data profiling techniques can be useful everywhere. But alas, this realization is not pervasive.


 As we moved towards 2010 and I was working in R&D at Informatica, I pushed hard to make data profiling ubiquitous for both business analysts and IT folks. Data Profiling is what everyone is doing at some point or other, and numerous stories have been written about the amount of time data scientists spent “preparing the data rather than analyzing it.” 

 

Doesn’t this sound familiar to the old Data Warehouse adage, “ETL is 70% of the work in a data warehouse project”? ETL was an old-fashioned word for data preparation. Therefore, I said, we should build a profiling interface in the rich-client for our developers, but also build a thin-client interface in the browser for our business brethren because they need profiling too. 


Progress continued. After laying this foundation, a more intelligent approach to profiling arose. "Why can’t profiling tools infer data domains and rules from past learnings?" Up sprung the Data Preparation tools market, including tools from Trifacta, Informatica (within the Enterprise Data Lake), and others. 

 

Informatica Data Lake Management can help you find, govern, and prepare data for analysis > >


Still, even today only a fraction of projects and teams are incorporating Data Profiling and Data Preparation into their workflows. How many data management projects fall behind because they run into data problems discovered late in the project? Find these problems early, so teams can adapt! 

 

Whether it’s DIY in the home, or data in the enterprise, good tools cost money. If you’re a professional with large-scale data issues, take another look at your tools and ensure you’re working with what it will take to tackle your challenges.

 





Share
Share
Share

Join the discussion