Configuring User Impersonation and Proxying
A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 5.x documentation.
PXF accesses Hadoop services on behalf of Greenplum Database end users. By default, PXF tries to access data source services (HDFS, Hive, HBase) using the identity of the Greenplum Database user account that logs into Greenplum Database and performs an operation using a PXF connector profile. Keep in mind that PXF uses only the login identity of the user when accessing Hadoop services. For example, if a user logs into Greenplum Database as the user
jane and then execute
SET ROLE or
SET SESSION AUTHORIZATION to assume a different user identity, all PXF requests still use the identity
jane to access Hadoop services.
With the default PXF configuration, you must explicitly configure each Hadoop data source (HDFS, Hive, HBase) to allow the PXF process owner (usually
gpadmin) to act as a proxy for impersonating users or groups. See Configuring Hadoop Proxying, Hive User Impersonation, and HBase User Impersonation.
As an alternative, you can disable PXF user impersonation. With user impersonation disabled, PXF executes all Hadoop service requests as the PXF process owner (usually
gpadmin). This behavior matches earlier releases of PXF, but it provides no means to control access to Hadoop services for different Greenplum Database users in Hadoop. It requires that the
gpadmin user have access to all files and directories in HDFS, and all tables in Hive and HBase that are referenced in PXF external table definitions. See Configuring PXF User Impersonation for information about disabling user impersonation.
Perform the following procedure to turn PXF user impersonation on or off in your Greenplum Database cluster. If you are configuring PXF for the first time, user impersonation is enabled by default. You need not perform this procedure.
Log in to your Greenplum Database master node as the administrative user:
$ ssh gpadmin@<gpmaster>
Recall the location of the PXF user configuration directory (
$PXF_CONF). Open the
$PXF_CONF/conf/pxf-env.shconfiguration file in a text editor. For example:
gpadmin@gpmaster$ vi $PXF_CONF/conf/pxf-env.sh
PXF_USER_IMPERSONATIONsetting in the
pxf-env.shfile. Set the value to
trueto turn PXF user impersonation on, or
falseto turn it off. For example:
Copy the updated
pxf-env.shfile to each Greenplum Database segment host. For example, if
seghostfilecontains a list, one-host-per-line, of the segment hosts in your Greenplum Database cluster and
gpadmin@gpmaster$ gpscp -v -f seghostfile $PXF_CONF/conf/pxf-env.sh =:/etc/pxf/usercfg/conf/pxf-env.sh
If you have previously started PXF, restart it on each Greenplum Database segment host as described in Restarting PXF to apply the new setting.
When PXF user personation is enabled (the default), you must configure the Hadoop
core-site.xml configuration file to permit user impersonation for PXF. Follow these steps:
On your Hadoop cluster, open the
core-site.xmlconfiguration file using a text editor, or use Ambari to add or edit the Hadoop property values described in this procedure.
Set the property
hadoop.proxyuser.<name>.hoststo specify the list of PXF host names from which proxy requests are permitted. Substitute the PXF proxy user (generally
<name>, and provide multiple PXF host names in a comma-separated list. For example:
<property> <name>hadoop.proxyuser.gpadmin.hosts</name> <value>pxfhost1,pxfhost2,pxfhost3</value> </property>
Set the property
hadoop.proxyuser.<name>.groupsto specify the list of HDFS groups that PXF can impersonate. You should limit this list to only those groups that require access to HDFS data from PXF. For example:
<property> <name>hadoop.proxyuser.gpadmin.groups</name> <value>group1,group2</value> </property>
core-site.xml, you must restart Hadoop for your changes to take effect.
Copy the updated
core-site.xmlfile to the PXF Hadoop configuration directory
$PXF_CONF/servers/defaulton the master and on each Greenplum Database segment host.
The PXF Hive connector uses the Hive MetaStore to determine the HDFS locations of Hive tables, and then accesses the underlying HDFS files directly. No specific impersonation configuration is required for Hive, because the Hadoop proxy configuration in
core-site.xml also applies to Hive tables accessed in this manner.
In order for user impersonation to work with HBase, you must enable the
AccessController coprocessor in the HBase configuration and restart the cluster. See 61.3 Server-side Configuration for Simple User Access Operation in the Apache HBase Reference Guide for the required
hbase-site.xml configuration settings.