5.4 BIG DATA OVERSIGHT: FIVE KEY CONCEPTSThe conclusion is that the standard approach to data governance in which data policies defined by an internal governance council direct control of the usability of datasets cannot be universally applied to big data applications. And yet there is definitely a need for some type of oversight that can ensure that the datasets are usable and that the analytic results are trustworthy. One way to address the need for data quality and consistency is to leverage the concept of data policies based on the information quality characteristics that are important to the big data project.This means considering the intended uses of the results of the analyses and how the inability to exercise any kind of control on the original sources of the information production flow can be mitigated by the users on the consumption side. This approach requires a number of key concepts for data practitioners and business process owners to keep in mind:• managing consumer data expectations;• identifying the critical data quality dimensions;• monitoring consistency of metadata and reference data as a basis for entity extraction;• repurposing and reinterpretation of data;• data enrichment and enhancement when possible.5.4.1 Managing Consumer Data ExpectationsThere may be a wide variety of users consuming the results of the spectrum of big data analytics applications. Many of these applications use an intersection of available datasets. Analytics applications are supposed to be designed to provide actionable knowledge to create or improve value. The quality of information must be directly related to the ways the business processes are either expected to be improved by better quality data or how ignoring data problems leads to undesired negative impacts, and there may be varied levels of interest in asserting levels of usability and acceptability for acquired datasets by different parties.This means, for the scope of the different big data analytics projects, you must ascertain these collective user expectations by engaging the different consumers of big data analytics to discuss how quality aspects of the input data that might affect the computed results. Some examples include:• datasets that are out of sync from a time perspective (e.g., one dataset refers to today’s transactions being compared to pricing data from yesterday);• not having all the datasets available that are necessary to execute the analysis;• not knowing if the data element values that feed the algorithms taken from different datasets share the same precision (e.g., sales per minute vs sales per hour);• not knowing if the values assigned to similarly named data attributes truly share the same underlying meaning (e.g., is a “customer” the person who pays for our products or the person who is entitledto customer support?).
Engaging the consumers for requirements is a process of discussions with the known end users, coupled with some degree of speculation and anticipation of who the pool of potential end users are, what they might want to do with a dataset, and correspondingly, what their levels of expectation are. Then, it is important to establish how those expectations can be measured and monitored, as well as the realistic remedial actions that can be taken.
5.4.2 Identifying the Critical Dimensions of Data Quality
An important step is to determine the dimensions of data quality that are relevant to the business and then distinguish those that are only measurable from those that are both measurable and controllable. This distinction is important, since you can use the measures to assess usability when you cannot exert control and to make corrections or updates when you do have control. In either case, here are some dimensions for measuring the quality of information used for big data analytics:
• Temporal consistency: Measuring the timing characteristics of datasets used in big data analytics to see whether they are aligned from a temporal perspective.
• Timeliness: Measuring if the data streams are delivered according to end-consumer expectations.
• Currency: Measuring whether the datasets are up to date.
• Completeness: Measuring that all the data is available.
• Precision consistency: Assessing if the units of measure associated with each data source share the same precision and if those units are properly harmonized if not.
• Unique identifiability: Focusing on the ability to uniquely identify entities within datasets and data streams and link those entities to known system of record information.
• Semantic consistency: This metadata activity may incorporate a glossary of business terms, hierarchies and taxonomies for business concepts, and relationships across concept taxonomies for standardizing ways that entities identified in structured and unstructured data are tagged in preparation for data use.
5.4.3 Consistency of Metadata and Reference Data for Entity Extraction
Big data analytics is often closely coupled with the concept of text analytics, which depends on contextual semantic analysis of streaming text and consequent entity concept identification and extraction. But before you can aspire to this kind of analysis, you need to ground your definitions within clear semantics for commonly used reference data and units of measure, as well as identifying aliases used to refer to the same or similar ideas.
Analyzing relationships and connectivity in text data is key to entity identification in unstructured text. But because of the variety of types of data that span both structured and unstructured sources, one must be aware of the degree to which unstructured text is replete with nuances, variation, and double meanings. There are many examples of this ambiguity, such as references to a car, a minivan, an SUV, a truck, a roadster, as well as the manufacturer’s company name, make, or model—all referring to an automobile.
These concepts are embedded in the value within a context, and are manifested as metadata tags, keywords, and categories that are often recognized as the terms that drive how search engine optimization algorithms associate concepts with content. Entity identification and extraction depend on the differentiation between words and phrases that carry high levels of “meaning” (such as person name, business names, locations, or quantities) from those that are used to establish connections and relationships, mostly embedded within the language of the text.
As data volumes expand, there must be some process for definition (and therefore control) over concept variation in source data streams. Introducing conceptual domains and hierarchies can help with semantic consistency, especially when comparing data coming from multiple
source data streams.
Be aware that context carries meaning; as there are different inferences about data concepts and relationship, you can make based on the identification of concept entities known within your reference data domains and how close they are found in the data source or stream. But since the same terms and phrases may have different meanings depending on the participating constituency generating the content, it yet again highlights the need for precision in semantics associated with concepts extracted from data sources and streams.
đang được dịch, vui lòng đợi..
