The way that we collect, store and access information has fundamentally changed, and with it, the way that data must be governed.
The majority of our information is no longer found under lock and key in corporate filing cabinets, instead now residing in cloud, hybrid and multi-cloud environments.
‘Static data’ has been replaced with ever-expanding buckets of digital unstructured documents, voice and video data, serving not only the transactional side of business operations, but also satisfying the constant search for new insights to drive competitive advantage.
Privacy regulations around the world have also transformed the way in which sensitive data must be identified, classified and handled, particularly when it is moved across borders.
And as workforces enjoy more and more data insights in their personal lives, whether tracking their fitness, journeys or household energy performance, it follows that the demand for self-service analytics in the workplace is also on a steep upward curve.
So we understand some of the new demands, but how can they be supported?
Agility by design is the answer.
A modern data and analytics architecture typically has a component based design, with specialist technologies linked through plug and play, each performing specialist roles such as ingestion, storage, access entitlement or BI dashboarding.
Similarly, the governance of the data itself is best supported by developing flexible capabilities – a combination of people, process and technology which can adapt and refocus with changing demand.
Traditional data management capabilities remain as important as ever – defining roles and accountable individuals, enforcing appropriate data policies and standards, measuring and reporting on data quality (and not just accuracy but related dimensions such as timeliness, uniqueness, conformity to standards).
But these are just the table stakes for a modern, date-led organisation and much more is needed alongside. So what are these additional capabilities?
The brain of a well-orchestrated Data and Analytics platform is a Data Catalog. This is where knowledge about the data is mastered and distributed – definitions, classifications, recording of locations, owners and stewards, master sources, critical elements, consumers, mapping between technical fields & logical business terms, even consumer sentiment about the usefulness of the data.
The controls and processes which support the Data Catalog are critical. An extensive and tightly managed Data Catalog can offer extensive automation opportunities for downstream applications.
Classifications of sensitive data can drive automated data access entitlements and row or column level data obfuscation and masking, and deliver this consistently for common data elements across multiple consuming systems.
Data tagging can also be used to highlight and enforce data usage restrictions such a licensing restrictions on vendor data which cannot be widely distributed.
Another area where organizations increasingly require mature capability is in data lifecycle management. This is the ability to classify and manage data from creation or ingestion through to archiving and ultimately destruction after legal data retention terms have elapsed.
Data storage and consumption locations are critical for all data and for the full data lifecycle, in support of privacy laws and to identify cross-border data transfers.
The ability to search for data using logical business terms, through a marketplace or directly in the catalog, allows data scientists and business users to identify useful data for self-serve analytics, returning their own sentiment scores on the data they have used or tagging newly created derived datasets for the benefit of others.
Information about data consumption is in itself now highly relevant. The intended purpose of data use is increasingly an important consideration, especially for provisioning access to sensitive datasets and one way that this can be managed is via data sharing agreements which will outline distributor, consumer, jurisdictions, intended use and the level or period of access granted.
With the rapid expansion of the use of Large Language Models (LLMs) such as ChatGPT, it has never been more important to consider and track ethical considerations around data usage. What data should either business users or AI engines have access to query?
It is clear that organisations capture huge amounts of data about their staff and customers – working locations, preferences, social interactions, health information, family members, financial matters, security screening and background check results. As a minimum, defining a data ethics policy is needed. Today there is invariably a human in the loop, initiating lab based AI activity and controlling what LLMs can consume. As AI transitions from lab based pilots to mainstream processing, ethical tags will need to be embedded alongside the data.
Evolving standards like the Enterprise Data Management Council’s Cloud Data Management Capabilities (CDMC) framework are helping both cloud data providers and the firms who consume their services, to think about some of these evolving data governance needs and benchmark their own cloud data maturity and controls against peers.
No two organizations have the same governance roadmap, as we all have a different start point and business priorities or pressures, but it is clear that good data governance underpins control, operating efficiency and business innovation. It has never been more critical to find order in the chaos.
Mark Davies is a London based Partner with Element22, a boutique data and analytics consultancy.