One Big Cluster Stuck: Data Asset Standardization

neub9
By neub9
5 Min Read

Data asset standardization is the purposeful and carefully planned consolidation of redundant, contradictory reports, processes, and databases into enterprise standards. The proliferation of data assets can have the greatest adverse impact on environmental health; standardization has many health benefits:

  • Reduces the likelihood that ill-constructed assets take down processes, nodes, and clusters
  • Reduces contention and competition for compute and storage
  • Reduces process and service failures and associated troubleshooting effort
  • Reduces effort spent maintaining and supporting redundant assets

Although the impacts of data asset standardization on environmental health can be higher than any other category in this series, the business value benefits exponentially outweigh them: standard data definitions, improved data governance, consistent data interpretation, greater data trustworthiness, and improved data-driven decision making. Ideally you are realizing these benefits using Cloudera Data Catalog.

Total data standardization is a multiyear journey and likely unnecessary, but the low hanging fruit is ripe for the picking. We strongly recommend embarking on this journey until returns diminish.

Report Standardization

Take these steps:

  1. Inventory reports, including ownership, usage statistics, and report frequency.
  2. Target for retirement any reports unused in the last year, then in the last 6 months. Pay particular attention to report frequency as low usage of an annual report may be appropriate.
  3. Select a report archival method commensurate with your customer partnership dynamic (we hope you’re not in data purgatory )
    1. Two weeks before, a week before, and the day of the archival, notify report owners as to which reports you intend to archive, allowing them a grace period to object and provide justification for the reports continued existence.
    2. Conversely, archive them without notification and restore a report when anyone shouts about it.
  4. Archive targeted reports. In Tableau, we prefer to simply assign report ownership to a system user which prohibits further use while enabling us to easily restore it if requested and justified.
  5. Repeat the exercise quarterly. In our experience, 80-90% of reporting inventory can be archived in as little as 2 quarters.
    1. If your visualization tool employs extract jobs, stop them, and note any database archival targets.
  6. Occasionally investigate the appropriateness of report refresh rates and negotiate.
  7. Over time, consolidate additional assets by grafting heavily used report features and functions into enterprise standard dashboards then retire redundant legacy reports. Admittedly, this is difficult and time consuming work usually undertaken as a means to trusted data, not environmental health.

DB Standardization

  1. As before, inventory database assets, ownership, refresh frequency, and associated usage statistics.
  2. Target temporary/testing databases and user databases owned by former FTE.
  3. Communicate far and wide. We’re not as brave as to archive dbs without notification and permission in most cases. We enjoy our jobs and want to keep them.
  4. Archive the databases. We usually archive into a common archival database. In our experience, this can reduce 35-55% of production tables.
  5. Occasionally negotiate refresh rates and data retention policies with database owners.
  6. We strongly recommend taking the multiyear journey to standardize centralized data assets into enterprise standards as much as possible as it can significantly improve data trustworthiness and accurate data-driven decision making.

Pipelines and Jobs Standardization

Database asset standardization will identify archival opportunities for (1) the pipeline inventory, here referring to processes which move data from one repository or source to another repository or curated dataset, as well as (2) the jobs inventory, here referring to queries which provide views or persist data within the environment. Standardizing processes is high effort with diminishing returns on environmental health; therefore, begin with processes that:

  • Frequently fail
  • Are most critical
  • Are most frequently updated
  • Are the most resource intensive

As always, if you need assistance identifying or executing data asset standardization, engage our Professional Services experts. We did!

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *