The current (4.2) release of CDH — Cloudera’s 100% open-source distribution of Apache Hadoop and related projects (including Apache HBase) — introduced a new HBase feature, recently landed in trunk, that allows an admin to take a snapshot of a specified table.
Prior to CDH 4.2, the only way to back-up or clone a table was to use Copy/Export Table, or after disabling the table, copy all the hfiles in HDFS. Copy/Export Table is a set of tools that uses MapReduce to scan and copy the table but with a direct impact on Region Server performance. Disabling the table stops all reads and writes, which will almost always be unacceptable.
In contrast, HBase snapshots allow an admin to clone a table without data copies and with minimal impact on Region Servers. Exporting the snapshot to another cluster does not directly affect any of the Region Servers; export is just a distcp with an extra bit of logic.
Here are a few of the use cases for HBase snapshots:
A snapshot is a set of metadata information that allows an admin to get back to a previous state of the table. A snapshot is not a copy of the table; it’s just a list of file names and doesn’t copy the data. A full snapshot restore means that you get back to the previous “table schema” and you get back your previous data losing any changes made since the snapshot was taken.
Operations
The main difference between a snapshot and a CopyTable/ExportTable is that the snapshot operations write only metadata. There are no massive data copies involved.
One of the main HBase design principles is that once a file is written it will never be modified. Having immutable files means that a snapshot just keeps track of files used at the moment of the snapshot operation, and during a compaction it is the responsibility of the snapshot to inform the system that the file should not be deleted but instead it should be archived.
The same principle applies to a Clone or Restore operation. Since the files are immutable a new table is created with just “links” to the files referenced by the snapshot.
Export Snapshot is the only operation that require a copy of the data, since the other cluster doesn’t have the data files.
Aside from the better consistency guarantees that a snapshot can provide compared to a Copy/Export Job, the main difference between Exporting a Snapshot and Copying/Exporting a table is that ExportSnapshot operates at HDFS level. This means that Master and Region Servers are not involved in this operations. Consequently, no unnecessary caches for data are created and there is no triggering of additional GC pauses due to the number of objects created during the scan process. The performance impact on the HBase cluster stems from the extra network and disk workload experienced by the DataNodes.
Confirm that snapshot support is turned on by checking if the hbase.snapshot.enabled property in hbase-site.xml is set to true. To take a snapshot of a specified table, use the snapshot command. (No file copies are performed)
hbase> snapshot ‘tableName’, ‘snapshotName’
To list all the snapshots, use the list_snapshot command. it will display the snapshot name, the source table, and the creation date and time.
hbase> list_snapshots SNAPSHOT TABLE + CREATION TIME TestSnapshot TestTable (Mon Feb 25 21:13:49 +0000 2013)
To remove a snapshot, use the delete_snapshot command. Removing a snapshot doesn’t impact cloned tables or other subsequent snapshots taken.
hbase> delete_snapshot 'snapshotName'
To create a new table from a specified snapshot (clone), use the clone_snapshot command. No data copies are performed, so you don’t end up using twice the space for the same data.
hbase> clone_snapshot 'snapshotName', 'newTableName'
To replace the current table schema/data with a specified snapshot content, use the restore_snapshot command.
hbase> restore_snapshot 'snapshotName'
To export an existing snapshot to another cluster, use the ExportSnapshot tool. The export doesn’t impact the RegionServers workload, it works at the HDFS level and you have to specify an HDFS location (the hbase.rootdir of the other cluster).
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot SnapshotName -copy-to hdfs:///srv2:8082/hbase
Snapshots rely on some assumptions, and currently there are a couple of tools that are not fully integrated with the new feature:
Currently the snapshot feature includes all the basic required functionality, but there’s still much work to do, including metrics, Web UI integration, disk usage optimizations and more.
To learn more about how to configure HBase and use snapshots, review the documentation.
Matteo Bertozzi is a Software Engineer on the Platform team, and an HBase committer.
This may have been caused by one of the following: