Do data scientists really spend 80% of their time wrangling data? Last time around, we examined this notion. But when it comes to data management, how can machine learning change data platforms for the better?

machine intelligence and ai, analytics planning and data analysis

In my last piece, I asked: do data scientists really spend 80% of their time wrangling data? Now it’s time for the follow-up: can machine learning make a difference in data management? Can it alter that 80/20 data cleansing ratio?

Machine Learning (ML) is a term that can mean just about anything. In evaluating a (proprietary) tool for data management that claims to use machine learning, you should understand what that means. It isn’t necessary to see the math or even the code that implements the algorithm.

It should suffice to understand what the algorithm evaluates, at least a high-level explanation of how it operates and what it produces.

Keep in mind that the fundamental workings of the algorithms are usually proprietary, so the explanations, if given, will be pretty high level. How coherent the explanation is, though, should help you understand what is real. 

Despite its lofty name, machine learning isn’t that mysterious. The most popular algorithms in use today are pretty mature. What makes them “machine learning” instead of just statistical models is the use of massive amounts of data, which was not previously possible. Some machine learning algorithms that are common in use are: 

  • Linear Regression 
  • Logistic Regression 
  • Linear Discriminant Analysis 
  • Classification and Regression Trees 
  • Naive Bayes 
  • K-Nearest Neighbors 
  • Learning Vector Quantization 
  • Support Vector Machines 
  • Bagging and Random Forest 
  • Boosting and AdaBoost 

Is Machine Learning Artificial Intelligence? 

There is a tendency to conflate machine learning with Artificial Intelligence (AI). There are two general fields of AI. The first, Artificial General Intelligence (AGI), is about machines having human-like cognition and human intelligence, but there is some disagreement about when or if we will reach that threshold. Each new bold advance in what appears to be AGI demonstrates that what was assumed to be intelligence turns out not to be. Facial recognition is a good example. The other is what is in place now: non-sentient machine intelligence, typically focused on a narrow task. This is where machine learning and AI get mixed up. 

For an ML algorithm to learn, it sifts through lots of data using a variety of statistical, non-parametric and other quantitative algorithms to find relationships, patterns and connections in the data. According to Judea Pearl, the Turing Award winner and author of “The Book of Why: The New Science of Cause and Effect,” ML cannot understand cause and effect. ML without causal capabilities, as Pearl derisively claims, “is just curve fitting.” 

Pearl has led the field in the issue of cause and effect, and while there is some truth in his comment, there are many applications for ML that are “just curve fitting.” For example, sifting through billions of records to find what relates to what and how strongly, and then having analysts or data stewards the opportunity to edit those findings. That’s how ML actually “learns.” 

For a data discovery/relationship discovery process to tie to a data catalog, the essential abilities are: 

  • The ability to scale as the data volumes are large; the processing is continuous.
  • The ML algorithms operate in supervised and unsupervised mode.
  • No ML discovery algorithm is perfect. User input is captured and cycled back into the ML process.
  • Continuous relearning and adapting of the models. 

I wrote a few years ago: The real magic in applying machine learning models to a software product is producing the right mix of things that are general enough to work with a wide range of situations and powerful enough to produce non-trivial results repeatedly (useless example, “Most auto injury accidents occur when the driver is at least 16 years old.”)  Supporting data science with Integrated (no code) tools requires creating and maintaining a comprehensive data catalog, but a few steps precede it. 

Relationship discovery

If you think about it, the most crucial part of managing collections of unalike data is finding relationships. Finding relationships between so many forms of data is practically impossible to do by hand. When dealing with tabular/columnar data, figuring out what names are likely to point to similar kinds of data (though not consistently accurate). Instead, the magic investigates the actual data to determine what it is.

To put this in perspective, if you have a few billion instances to compare, this can be a computationally expensive (read, slow) process. Here is the first example of machine learning boosting the process. Using some of the algorithms mentioned above, an unsupervised machine learning model can quickly break down the similarities and converge to a solution. As the process flows through the data collection, it builds a relationship map that drives all of the elements of the system. Some powerful  techniques that data discovery vendors are employing to find these relationships are   

  • Recurrent Convolutional Neural Networks RCNN.
  • Semi-Structured Data Parsing: Hidden Markov Model and Gene Sequencing algorithms.

Recommendations are then provided to help the analyst join data sets, enrich the data, choose columns, add filters, and aggregate the data. The algorithms convert the mapping recommendation problem into a machine translation problem using:

  • Encoder-Decoder architecture for primitive one-to-one mappings.
  • Then using maximal grouping.
  • An Attention Neural Network (ANN) is used to resolve the recommendation.

Data flow 

Machine learning-based discovery of how data flows between databases and data sources and ultimately how data moves through the organization; discovering where data emanates and the affinities in the data itself. 

Sensitive data discovery

There are two types of sensitive data in sources. The first is the obvious personal information such as name, social security number, date of birth, and demographic, sociographic, and psychographic data. The problem is that this data may not be identifiable by merely looking at the column names or other available metadata. Only by examining the data itself can an algorithm decide the data within the ” sensitive realm.”

But there is a deeper problem. Personally Identifiable Information (PII) is the term for seemingly non-sensitive information that can be combined with other non-sensitive details to create an “emergent” identity. Additionally, there may be information that is considered sensitive or confidential to an organization that is defined by company policy, which may also be considered within the realm of “sensitive.” 

Considering these types of sensitive data, there are many issues where it is essential to manage the process. First, of course, are regulatory issues, such as the recently enabled General Data Protection Regulation (GDPR). But there are also organizational promises to customers and suppliers to be good stewards of data you collect about them. It is relatively easy to govern these policies when a single internal system generates and manages the data. Still, if the data is scattered across sources and locations, gaps in governance and even the “emergent” problem can occur. 

And finally, the connection between policy and digital processing is wide. The policy is stated in natural language, but how that policy is implemented in software can be pretty tricky. 

Impact analysis

Like a trend analysis, this captures changes in the source data at different points in time. For example, if new sensitive data is introduced into the database, impact analysis can determine when that occurred and quantify the delta.

Redundant data analysis

Redundant data may, and usually does, have different modification cycles, leading to data confusion. Generally, there aren’t redundant data sources of primary enterprise data (though it happens). Still, other data sets can creep into the universe of sources, such as saved analysis outputs, training data sets and even spreadsheets. The relationship map can identify these redundant sources and allow the analysts to choose the appropriate one. 

Organizations can accumulate vast quantities of redundant data. They may be impacted by storage costs and unknowingly leave such data unmanaged and unprotected. Redundant data also requires management so that organizations can decide on the appropriate remediation steps as part of the data management process once identified. 10 

Data catalog

Most important. The automated data catalog is driven by relationship discovery. The whole point of a semantically rich data catalog is to provide analysts, data scientists, business and technology users (anyone who uses data, actually) a means to find the data needed, to understand what it means, how it relates to other data, its flow and to support collaboration and enable good data governance, data management and ultimately business analytics. Unlike proprietary metadata of an application, such as enterprise applications like ERP or CRM or the proprietary metadata of Business Intelligence and visualization tools, the catalog is not tied to a specific schema or model. Its generality is the key to its usefulness. 

The most common repositories of metadata relate to customer and product domains. There is no doubt that these repositories are useful, but they lack perhaps 90 percent of the valuable data for analytics and data science. 

My take

Machine Learning alone cannot break through the 80% problem, but it is the necessary element if applied intelligently. A unified platform, from data discovery to data catalog, can vastly reduce the time it takes to do the analytics required for digital transformation. 


Below-average harmful algal bloom predicted for western Lake Erie

Bloom severity index for 2002-2021, and the forecast for 2022. The index is based on the amount of biomass over the peak 30-days. Credit: NOAA NOAA and its research partners are forecasting that western Lake Erie will experience a smaller-than-average harmful algal bloom (HAB) this summer, which would make ...

View more: Below-average harmful algal bloom predicted for western Lake Erie

Universal optothermal micro/nanoscale rotors

Working mechanism of light-driven out-of-plane rotation of micro/nanoscale rotors. (A) A simplified schematic illustrating the experimental setup and operation for OTER of micro/nanoparticles. (B) Working mechanism of OTER: (i) In the nonuniform temperature field, Na+ and Cl− ions and PEG molecules diffuse to the cold region. Yellow arrows indicate ...

View more: Universal optothermal micro/nanoscale rotors

RadioShack Rises From the Dead (Again) to Shill Crypto, Because Of Course

Snjivo/Shutterstock Do you ever get the urge to buy an old brand name like RadioShack and use its lifeless corpse to sell cryptocurrency? Oh, me neither. But that’s exactly what’s happening today. Retail Ecommerce Ventures, an unregulated investment company owned by Tai Lopez and Alex Mehr, is preparing to launch ...

View more: RadioShack Rises From the Dead (Again) to Shill Crypto, Because Of Course

Tesla Submits Plans to Add Massive New Structure to its Gigafactory Texas Facility

Tesla plans to construct a huge additional facility at Gigafactory Texas. Even though it is the world’s largest structure, the company sees potential to grow at the site. Tesla Requested Permission to Build Another Enormous Building in Texas The city of Austin has received a request from Tesla to ...

View more: Tesla Submits Plans to Add Massive New Structure to its Gigafactory Texas Facility

Tesla Allegedly Abuses Employees Following Recent Racial Lawsuit

Tesla is currently facing a new lawsuit filed by its own employees. According to the complainants, the electric vehicle maker allegedly abused them because of their race. Tesla Racial Lawsuit (Photo : Tesla Fans Schweiz from Unsplash)Tesla is facing another lawsuit related to racial discrimination and sexual harassment. As per The ...

View more: Tesla Allegedly Abuses Employees Following Recent Racial Lawsuit

Tesla Gigafactory Texas Expansion Request Has Been Filed! Here's How Massive the Additional Plant Is

The Tesla Gigafactory Texas expansion is now confirmed after the giant electric carmaker filed a request to the city of Austin. (Photo : Photo by SUZANNE CORDEIRO/AFP via Getty Images)CEO of Tesla Motors Elon Musk speaks at the Tesla Giga Texas manufacturing “Cyber Rodeo” grand opening party on April 7, ...

View more: Tesla Gigafactory Texas Expansion Request Has Been Filed! Here's How Massive the Additional Plant Is

Study begins to unravel the mysterious evolution of fatherless male insects

Examples of the gnats (left and middle) and springtrail (right) species used in the study. Credit: San Francisco State University It’s not often that you see genetic systems described as “bizarre” in the title of a scientific research paper. That is unless it’s from the lab of San Francisco ...

View more: Study begins to unravel the mysterious evolution of fatherless male insects

Ibuprofen tablets with flavor added survive better in space

Credit: Pixabay/CC0 Public Domain Ibuprofen tablets modified to survive in space have returned to Earth and shown that those with added flavor survived better with less degradation than those with no added taste. Researchers from the International Flavor Research Center at the University of Nottingham worked with the University ...

View more: Ibuprofen tablets with flavor added survive better in space

Learning to combat DDOS attacks

Companies have a simple and legal way to help their workers living in anti-abortion states—expanding paid time off

Scientists decipher, catalog the diverse origins of Earth's minerals

Who overturning Roe hurts most, explained in 7 charts

Did You Know That Sony Walkman, the Portable Cassette Player, Hit the Shelves on This Day in 1979?

Govt may launch national malware repository; to create robust cybersecurity regime

Indian unicorn heads meet British PM Boris Johnson; discuss India-UK collaboration

Chinese game developer miHoYo, creator of Genshin Impact, sues Minmetals trust firm in a case of speculative investment gone bad

eBay's Fourth of July Sale Includes Deals on Home Decor, Kitchen Gadgets, Tech and More

How to find marker genes in cell clusters

Here are the most effective things you can do to fight climate change

How reindeer eyes transform in winter to give them twilight vision


Top Car News Car News