5 Data Governance for Big Data Analytics:
Considerations for Data Policies and Processes
It should not come as a surprise that in a big data environment, much like any environment, the end users might have concerns about the believability of analytical results. This is particularly true when there is limited visibility into trustworthiness of the data sources. One added challenge is that even if the producers of the data sources are known, the actual derivation of the acquired datasets may still remain opaque. Striving for data trustworthiness has driven the continued development and maturation of processes and tools for data quality assurance, data standardization, and data cleansing. In general, data quality is generally seen as a mature discipline, particularly when the focus is evaluating datasets and applying remedial or corrective actions to data to ensure that the datasets are fit for the purposes for which they were originally intended.
5.1 THE EVOLUTION OF DATA GOVERNANCE
In the past 5 years or so, there have been a number of realizations that have, to some extent, disrupted this perception of “data quality maturity,” namely:
• Correct versus correction: In many environments, tools are used to fix data, not to ensure that the data is valid or correct. What was once considered to be the cutting edge in terms of identifying and then fixing data errors has, to some extent, fallen out of favor in lieu of process-oriented validation, root cause analysis, and remediation.
• Data repurposing: More organizational stakeholders recognize that datasets created for one functional purpose within the enterprise (such as sales, marketing, accounts payable, or procurement to name a few) are used multiple times in different contexts, particularly for reporting and analysis. The implication is that data quality can no longer be measured in terms of “fitness for purpose,” but instead must be evaluated in terms of “fitness for purposes,” taking all downstream uses and quality requirements into account.
• The need for oversight: This realization, which might be considered a follow-on to the first, is that ensuring the usability of data for all purposes requires more comprehensive oversight. Such oversight should include monitored controls incorporated into the system development life cycle and across the application infrastructure.
These realizations lead to the discipline called data governance. Data governance describes the processes for defining corporate data policies, describing processes for operationalizing observance of those policies, along with the organizational structures that include data governance councils and data stewards put in place to monitor, and hopefully ensure compliance with those data policies.
Stated simply, the objective of data governance is to institute the right levels of control to achieve one of three outcomes:
1. Alert: Identify data issues that might have negative business impact.
2. Triage: Prioritize those issues in relation to their corresponding business value drivers.
3. Remediate: Have data stewards take the proper actions when alerted to the existence of those issues.
When focused internally, data governance not only enables a degree of control for data created and shared within an organization, it empowers the data stewards to take corrective action, either through communication with the original data owners or by direct data intervention (i.e., “correcting bad data”) when necessary.
5.2 BIG DATA AND DATA GOVERNANCE
Naturally, concomitant with the desire for measurably high quality information in a big data environment is the inclination to institute “big data governance.” It is naive, however, to assert that when it comes to big data governance one should adopt the traditional approaches to data quality. Furthermore, one cannot assume that just because vendors, system integrators, and consultants stake their claims over big data by stressing the need for “big data quality” that the same methods and tools can be used to monitor, review, and correct data streaming into a big data platform.
Upon examination, the key characteristics of big data analytics are not universally adaptable to the conventional approaches to data quality and data governance. For example, in a traditional approach to data quality, levels of data usability are measured based on the idea of “data quality dimensions,” such as:
• Accuracy, referring to the degree to which the data values are correct.
• Completeness, which specifies the data elements that must have values.
• Consistency of related data values across different data instances.
• Currency, which looks at the “freshness” of the data and whether the values are up to date or not.
• Uniqueness, which specifies that each real-world item is represented once and only once within the dataset.
These types of measures are generally intended to validate data using defined rules, catch any errors when the input does not conform to those rules, and correct recognized errors when the situations allow it. This approach typically targets moderately sized datasets, from known sources, with structured data, with a relatively small set of rules. Operational and analytical applications of limited size can integrate data quality controls, alerts, and corrections, and those corrections will reduce the downstream negative impacts.
5.3 THE DIFFERENCE WITH BIG DATASETS
On the other hand, big datasets neither exhibit these characteristics, nor do they have similar types of business impacts. Big data analytics is generally centered on consuming massive amounts of a combination of structured and unstructured data from both machine-generated and human sources. Much of the analysis is done without considering the business impacts of errors or inconsistencies across the different sources, from where the data originated, or how frequently it is acquired.
Big data applications look at many input streams originating from within and outside the organization, some taken from a variety of social networking streams, syndicated data streams, news feeds, preconfigured search filters, public or open-sourced datasets, sensor networks, or other unstructured data streams. Such diverse datasets resist singular approaches to governance.
When the acquired datasets and data streams originate outside the organization, there is little facility for control over the input. The original sources are often so obfuscated that there is little capacity to even know who created the data in the first place, let alone enable any type of oversight over data creation.
Another issue involves the development and execution model for big data applications. Data analysts are prone to develop their own models in their private sandbox environments. In these cases, the developers often bypass traditional IT and data management channels, opening greater possibilities for inconsistencies with sanctioned IT projects. This is complicated more as datasets are tapped into or downloaded directly without IT’s intervention.
Consistency (or the lack thereof) is probably the most difficult issue. When datasets are created internally and a downstream user recognizes a potential error, that issue can be communicated to the originating system’s owners. The owners then have the opportunity to find the root cause of the problems and then correct the processes that led to the errors.
But with big data systems that absorb massive volumes of data, some of which originates externally, there are limited opportunities to engage process owners to influence modifications or corrections to the source. On the other hand, if you opt to “correct” the recognized data error, you are introducing an inconsistency with the original source, which at worst can lead to incorrect conclusions and flawed decision making.
5.4 BIG DATA OVERSIGHT: FIVE KEY CONCEPTS
The conclusion is that the standard approach to data governance in which data policies defined by an internal governance council direct control of the usability of datasets cannot be universally applied to big data applications. And yet there is definitely a need for some type of oversight that can ensure that the datasets are usable and that the analytic results are trustworthy. One way to address the need for data quality and consistency is to leverage the concept of data policies based on the information quality characteristics that are important to the big data project.
This means considering the intended uses of the results of the analyses and how the inability to exercise any kind of control on the original sources of the information production flow can be mitigated by the users on the consumption side. This approach requires a number of key concepts for data practitioners and business process owners to keep in mind:
• managing consumer data expectations;
• identifying the critical data quality dimensions;
• monitoring consistency of metadata and reference data as a basis for entity extraction;
• repurposing and reinterpretation of data;
• data enrichment and enhancement when possible.
5.4.1 Managing Consumer Data Expectations
There may be a wide variety of users consuming the results of the spectrum of big data analytics applications. Many of these applications use an intersection of available datasets. Analytics applications are supposed to be designed to provide actionable knowledge to create or improve value. The quality of information must be directly related to the ways the business processes are either expected to be improved by better quality data or how ignoring data problems leads to undesired negative impacts, and there may be varied levels of interest in asserting levels of usability and acceptability for acquired datasets by different parties.
This means, for the scope of the different big data analytics projects, you must ascertain these collective user expectations by engaging the different consumers of big data analytics to discuss
đang được dịch, vui lòng đợi..
