Cloudera Data Platform (CDP) supports access controls on tables and columns, as well as on files and directories via Apache Ranger since its first release. It is common to have different workloads using the same data – some require authorizations at the table level (Apache Hive queries) and others at the underlying files (Apache Spark […]
Cloudera Data Platform (CDP) supports access controls on tables and columns, as well as on files and directories via Apache Ranger since its first release. It is common to have different workloads using the same data – some require authorizations at the table level (Apache Hive queries) and others at the underlying files (Apache Spark jobs). Unfortunately, in such instances you would have to create and maintain separate Ranger policies for both Hive and HDFS, that correspond to each other.
As a result, whenever a change is made on a Hive table policy, the data admin should make a consistent change in the corresponding HDFS policy. Failure to do so could result in security and/or data exposure issues. Ideally the data admin would set a single table policy, and the corresponding file access policies would automatically be kept in sync along with access audits, referring to the table policy that enforced it.
In this blog post I will introduce a new feature that provides this behavior called the Ranger Resource Mapping Service (RMS). The RMS was included in CDP Private Cloud Base 7.1.4 as tech preview and became GA in CDP Private Cloud Base 7.1.5.
In a nutshell, Ranger RMS enables automatic translation of access policies from Hive to HDFS, reducing the operational burden of policy management. In simple terms, this means that any user with access permissions on a Hive table automatically receives similar HDFS file level access permissions on the table’s data files. So, Ranger RMS allows you to authorize access to HDFS directories and files using policies defined for Hive tables. Any access authorization materialized through Ranger RMS is fully audited and would be present in Ranger Audit logs.
The functionality provided by Ranger RMS is very useful for the usage of external table data by non-Hive workloads such as Spark. Let us consider the users who are currently on the different versions of the Cloudera product stack to understand how this feature would benefit them.
With the introduction of Ranger RMS in CDP Private Cloud Base 7.1.4, Ranger provides an equivalent functionality as the Sentry HDFS ACL sync in CDH. Users upgrading or migrating from CDH to the latest version of CDP should not worry about losing this important capability. Additionally, for HDP users upgrading or migrating to the latest version of CDP, Ranger RMS removes the need for separately managed storage policies on Hive table locations. This means many manually implemented Ranger HDFS policies, Hadoop ACLs, or POSIX permissions created solely for this purpose can now be removed, if desired. This eases the operational maintenance requirement for policies and reduces the chance of mistakes that can happen during the manual steps performed by a data steward or admin.
As suggested, Ranger RMS internally translates the Hive policies into HDFS access rules and allows the HDFS NameNode to enforce them. Though it seems direct and simple, the automatic translation of access considers several factors before granting any user the HDFS file level access. Ranger RMS does not create explicit HDFS policies in Ranger, nor does it change the HDFS ACLs presented to users (as Apache Sentry does for ‘hdfs dfs -ls’ command). Instead, it generates a mapping that allows the Ranger Plugin in HDFS to make run-time decisions based on the Hadoop SQL grants.
The picture below shows the interactions between NameNode, Ranger Admin, Hive Metastore and Ranger RMS.
Each time Ranger RMS starts, it connects to the Hive Metastore (HMS) and performs a sync that generates a resource mapping file linking Hive resources to their storage locations on HDFS. This mapping file is stored locally in a storage cache within RMS. It is also persisted in RMS specific tables in the backend database used by the Ranger service. After startup, Ranger RMS periodically updates this map, querying HMS every 30 seconds for new mapping updates via Hive notifications. This polling frequency is configurable using the “ranger-rms.polling.notifications.frequence.ms” setting.
If no sync checkpoint exists or it is not recognized by HMS, then a full sync is done. For example, clearing the Ranger RMS mapping in the backend database may be required when specific configuration changes are made. Such an action clears the sync checkpoint and causes a full resync to occur the next time RMS is restarted. The synchronization process queries HMS to gather Hive object metadata, such as database name and table name to map to the underlying file paths in HDFS.
On the HDFS side, the Ranger HDFS plugin running in the NameNode now has an additional HivePolicyEnforcer module. In addition to downloading the HDFS policies from Ranger Admin, this enhanced HDFS plugin also downloads Hive policies from Ranger Admin, along with the mapping file from Ranger RMS. HDFS access is then determined by both HDFS policies and Hive policies.
During the evaluation of access requirements for HDFS files, any HDFS policies from Ranger are applied first. It is then checked to see if the HDFS resource has an entry in the resource mapping file provided by RMS. Next the corresponding Hive resource is computed from the mapping and then Hive policies present in Ranger are applied. Finally depending on the various configurations, the composite evaluation result is computed.
The following flow diagram depicts the policy evaluation process when Ranger RMS is involved. It is important to note that even if you have Ranger RMS enabled, manually created HDFS policies will have priority and can override RMS policy behavior.
There are two types of Hive tables in CDP – Managed Tables and External Tables. A detailed explanation of these different types can be found in the official documentation for Hive. While the primary use case for Ranger RMS is to simplify External table policy management, there are specific conditions where you may need to enable this on Managed tables.
Spark Direct Reader mode in Hive Warehouse Connector can read Hive transactional tables directly from the filesystem. This is one special use-case where enabling RMS for Hive Managed Tables can be beneficial. The user launching such a Spark application must have read and execute permissions on the Hive warehouse location. If your environment has many applications using Spark direct reader to consume Hive transactional table data within a Spark application, you can consider enabling Ranger RMS for Hive Managed Tables. Apart from this use-case, the recommendation is to not open up the Hive warehouse location for Managed Tables through RMS for security reasons. Please study your use-cases properly specifically for Hive Managed Tables before enabling Ranger RMS for Managed Tables. This feature may not be desirable in many scenarios where Hive Managed Tables location should be locked down completely.
The most important point for Ranger RMS on a Managed Table is the location. It should be noted that the location for such Managed Tables can only be the managed space of the database in which they are created. If a location is not defined at the database level, then it defaults to the value of “hive.metastore.warehouse.dir”.
Ranger RMS does provide an option to map managed tables as specified below.
If this configuration is not enabled in Ranger RMS, then the default Hive managed space would be locked down to the user and group for “hive”. In such a case, users belonging to any other groups would not have any access to the default hive managed location in HDFS.
Though Ranger RMS provides an option to map Hive Managed Tables, it needs to be understood that enabling this feature for Managed Tables provides users who have SQL level access on such tables with direct read access on the corresponding tables’ HDFS files. The following should be noted when enabling this feature in RMS. Users with update permission on Managed Tables in Hadoop SQL would be able to update table data through SQL. But RMS would not provide users with direct write access on the managed HDFS location even if they have update permission in Hadoop SQL.
If Ranger RMS is installed and configured in a CDP environment the following details provide high-level access requirements for performing read/write on Hive tables’ corresponding HDFS files.
Ranger RMS adds a key capability that enhances the security design of CDP clusters. By providing an automatic method to use Hive policies to access the corresponding tables’ HDFS data directly, it not only brings in an important feature to the Ranger data security framework, but also reduces the policy management overhead considerably.
To learn more about Ranger RMS and related features, here are some helpful resources: