Configuring the Hadoop User, User Impersonation, and Proxying
A newer version of this documentation is available. Use the version menu above to view the most up-to-date release of the Greenplum 6.x documentation.
PXF accesses Hadoop services on behalf of Greenplum Database end users.
When user impersonation is enabled (the default), PXF accesses Hadoop services using the identity of the Greenplum Database user account that logs in to Greenplum and performs an operation that uses a PXF connector. Keep in mind that PXF uses only the login identity of the user when accessing Hadoop services. For example, if a user logs in to Greenplum Database as the user jane
and then execute SET ROLE
or SET SESSION AUTHORIZATION
to assume a different user identity, all PXF requests still use the identity jane
to access Hadoop services. When user impersonation is enabled, you must explicitly configure each Hadoop data source (HDFS, Hive, HBase) to allow PXF to act as a proxy for impersonating specific Hadoop users or groups.
When user impersonation is disabled, PXF executes all Hadoop service requests as the PXF process owner (usually gpadmin
) or the Hadoop user identity that you specify. This behavior provides no means to control access to Hadoop services for different Greenplum Database users. It requires that this user have access to all files and directories in HDFS, and all tables in Hive and HBase that are referenced in PXF external table definitions.
You configure the Hadoop user and PXF user impersonation setting for a server via the pxf-site.xml
server configuration file. Refer to About Kerberos and User Impersonation Configuration (pxf-site.xml) for more information about the configuration properties in this file.
The following table describes some of the PXF configuration scenarios for Hadoop access:
Scenario | pxf-site.xml Required | Impersonation Setting | Required Configuration |
---|---|---|---|
PXF accesses Hadoop using the identity of the Greenplum Database user. | yes | true | Enable user impersonation, identify the Hadoop proxy user in the pxf.service.user.name , and configure Hadoop proxying for this Hadoop user identity. |
PXF accesses Hadoop using the identity of the operating system user that started the PXF process. | yes | false | Disable user impersonation. |
PXF accesses Hadoop using a user identity that you specify. | yes | false | Disable user impersonation and identify the Hadoop user identity in the pxf.service.user.name property setting. |
Configure the Hadoop User
By default, PXF accesses Hadoop using the identity of the Greenplum Database user, and you are required to set up a proxy Hadoop user. You can configure PXF to access Hadoop as a different user on a per-server basis.
Perform the following procedure to configure the Hadoop user:
Log in to your Greenplum Database master node as the administrative user:
$ ssh gpadmin@<gpmaster>
Identify the name of the PXF Hadoop server configuration that you want to update.
Navigate to the server configuration directory. For example, if the server is named
hdp3
:gpadmin@gpmaster$ cd $PXF_CONF/servers/hdp3
If the server configuration does not yet include a
pxf-site.xml
file, copy the template file to the directory. For example:gpadmin@gpmaster$ cp $PXF_CONF/templates/pxf-site.xml .
Open the
pxf-site.xml
file in the editor of your choice, and configure the Hadoop user name. When impersonation is disabled, this name identifies the Hadoop user identity that PXF will use to access the Hadoop system. When user impersonation is enabled, this name identifies the PXF proxy Hadoop user. For example, if you want to access Hadoop as the userhdfsuser1
:<property> <name>pxf.service.user.name</name> <value>hdfsuser1</value> </property>
Save the
pxf-site.xml
file and exit the editor.Use the
pxf cluster sync
command to synchronize the PXF Hadoop server configuration to your Greenplum Database cluster. For example:gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
Configure PXF User Impersonation
PXF user impersonation is enabled by default for Hadoop servers. You can configure PXF user impersonation on a per-server basis. Perform the following procedure to turn PXF user impersonation on or off for the Hadoop server configuration:
PXF_USER_IMPERSONATION
setting in the pxf-env.sh
configuration file.Navigate to the server configuration directory. For example, if the server is named
hdp3
:gpadmin@gpmaster$ cd $PXF_CONF/servers/hdp3
If the server configuration does not yet include a
pxf-site.xml
file, copy the template file to the directory. For example:gpadmin@gpmaster$ cp $PXF_CONF/templates/pxf-site.xml .
Open the
pxf-site.xml
file in the editor of your choice, and update the user impersonation property setting. For example, if you do not require user impersonation for this server configuration, set thepxf.service.user.impersonation
property tofalse
:<property> <name>pxf.service.user.impersonation</name> <value>false</value> </property>
If you require user impersonation, turn it on:
<property> <name>pxf.service.user.impersonation</name> <value>true</value> </propery>
If you enabled user impersonation, you must configure Hadoop proxying as described in Configure Hadoop Proxying. You must also configure Hive User Impersonation and HBase User Impersonation if you plan to use those services.
Save the
pxf-site.xml
file and exit the editor.Use the
pxf cluster sync
command to synchronize the PXF Hadoop server configuration to your Greenplum Database cluster. For example:gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
Configure Hadoop Proxying
When PXF user impersonation is enabled for a Hadoop server configuration, you must configure Hadoop to permit PXF to proxy Greenplum users. This configuration involves setting certain hadoop.proxyuser.*
properties. Follow these steps to set up PXF Hadoop proxy users:
Log in to your Hadoop cluster and open the
core-site.xml
configuration file using a text editor, or use Ambari or another Hadoop cluster manager to add or edit the Hadoop property values described in this procedure.Set the property
hadoop.proxyuser.<name>.hosts
to specify the list of PXF host names from which proxy requests are permitted. Substitute the PXF proxy Hadoop user for<name>
. The PXF proxy Hadoop user is thepxf.service.user.name
that you configured in the procedure above, or, if you are using Kerberos authentication to Hadoop, the proxy user identity is the primary component of the Kerberos principal. If you have not configuredpxf.service.user.name
, the proxy user is the operating system user that started PXF. Provide multiple PXF host names in a comma-separated list. For example, if the PXF proxy user is namedhdfsuser2
:<property> <name>hadoop.proxyuser.hdfsuser2.hosts</name> <value>pxfhost1,pxfhost2,pxfhost3</value> </property>
Set the property
hadoop.proxyuser.<name>.groups
to specify the list of HDFS groups that PXF as Hadoop user<name>
can impersonate. You should limit this list to only those groups that require access to HDFS data from PXF. For example:<property> <name>hadoop.proxyuser.hdfsuser2.groups</name> <value>group1,group2</value> </property>
You must restart Hadoop for your
core-site.xml
changes to take effect.Copy the updated
core-site.xml
file to the PXF Hadoop server configuration directory$PXF_CONF/servers/<server_name>
on the Greenplum Database master and synchronize the configuration to the standby master and each Greenplum Database segment host.
Hive User Impersonation
The PXF Hive connector uses the Hive MetaStore to determine the HDFS locations of Hive tables, and then accesses the underlying HDFS files directly. No specific impersonation configuration is required for Hive, because the Hadoop proxy configuration in core-site.xml
also applies to Hive tables accessed in this manner.
HBase User Impersonation
In order for user impersonation to work with HBase, you must enable the AccessController
coprocessor in the HBase configuration and restart the cluster. See 61.3 Server-side Configuration for Simple User Access Operation in the Apache HBase Reference Guide for the required hbase-site.xml
configuration settings.