HDFS Snapshot Best Practices – Cloudera Blog

neub9
By neub9
3 Min Read

Introduction

The Apache Hadoop Distributed Filesystem (HDFS) features snapshots to capture point-in-time copies of the file system, protecting data against corruption, user-, or application errors. This feature is available across Cloudera Data Platform (CDP), Cloudera Distribution for Hadoop (CDH) and Hortonworks Data Platform (HDP). This blog provides insights and techniques to help optimize the use of snapshots.

Snapshots to Protect Data

Snapshots are efficient for several reasons. Snapshot creation is instantaneous regardless of the size and depth of the directory subtree, without creating extra copies of blocks on the file system. The HDFS snapshot feature is designed to be very efficient for the snapshot creation operation and for accessing or modifying the current files and directories in the file system. It only adds a snapshot record to the snapshottable directory.

However, there are complexities with various snapshot operations, which can have overheads. In this blog, we look at the time complexity of these operations and highlight best practices to mitigate this.

Understanding Snapshot Operations

Here’s a summary of the time complexity or overheads dealing with different operations on snapshotted files or directories:

  1. Taking a snapshot: O(1) – Adding a snapshot record
  2. Accessing a file/directory in the current state: No additional overhead from snapshots
  3. Modifying a file/directory in the current state: Adding a modification for each input path
  4. Accessing a file/directory in a particular snapshot: O(d*m) – d – the depth, m – the number of modifications
  5. Deleting a snapshot: O(b + n log(m)) – b – the number of blocks to be collected, n – the number of files/directories, m – the number of modifications
  6. Computing snapshot diff: O(n*(m+s)) n – the number of files/directories, m – the number of modifications, s – the number of snapshots between the newer and the older snapshots

Best Practices to Avoid Pitfalls

To make the most of HDFS Snapshot usage, it’s important to consider the following best practices:

  1. Don’t create snapshots at the root directory: Create snapshots at the project directories and the user directories.
  2. Avoid taking very frequent snapshots
  3. Avoid running snapshot diff when the delta is very large (multiple days/weeks/months of changes or more than 1 million changes)
  4. Avoid running snapshot diff for the snapshots that are far apart (e.g. diff between two snapshots taken a month apart)
  5. Avoid running snapshot diff at the snapshottable directory

Adhering to these best practices will maximize the efficiency and effectiveness of snapshots in HDFS.

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *