Configuring User Impersonation and Proxying

A newer version of this documentation is available. Use the version menu above to view the most up-to-date release of the Greenplum 6.x documentation.

PXF accesses Hadoop services on behalf of Greenplum Database end users. By default, PXF tries to access data source services (HDFS, Hive, HBase) using the identity of the Greenplum Database user account that logs into Greenplum Database and performs an operation using a PXF connector profile. Keep in mind that PXF uses only the login identity of the user when accessing Hadoop services. For example, if a user logs into Greenplum Database as the user jane and then execute SET ROLE or SET SESSION AUTHORIZATION to assume a different user identity, all PXF requests still use the identity jane to access Hadoop services.

With the default PXF configuration, you must explicitly configure each Hadoop data source (HDFS, Hive, HBase) to allow the PXF process owner (usually gpadmin) to act as a proxy for impersonating users or groups. See Configuring Hadoop Proxying, Hive User Impersonation, and HBase User Impersonation.

As an alternative, you can disable PXF user impersonation. With user impersonation disabled, PXF executes all Hadoop service requests as the PXF process owner (usually gpadmin). This behavior matches earlier releases of PXF, but it provides no means to control access to Hadoop services for different Greenplum Database users in Hadoop. It requires that the gpadmin user have access to all files and directories in HDFS, and all tables in Hive and HBase that are referenced in PXF external table definitions. See Configuring PXF User Impersonation for information about disabling user impersonation.

Configure PXF User Impersonation

Perform the following procedure to turn PXF user impersonation on or off in your Greenplum Database cluster. If you are configuring PXF for the first time, user impersonation is enabled by default. You need not perform this procedure.

  1. Log in to your Greenplum Database master node as the administrative user:

    $ ssh gpadmin@<gpmaster>
    
  2. Recall the location of the PXF user configuration directory ($PXF_CONF). Open the $PXF_CONF/conf/pxf-env.sh configuration file in a text editor. For example:

    gpadmin@gpmaster$ vi $PXF_CONF/conf/pxf-env.sh
    
  3. Locate the PXF_USER_IMPERSONATION setting in the pxf-env.sh file. Set the value to true to turn PXF user impersonation on, or false to turn it off. For example:

    PXF_USER_IMPERSONATION="true"
    
  4. Use the pxf cluster sync command to copy the updated pxf-env.sh file to the Greenplum Database cluster. For example:

    gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
    
  5. If you have previously started PXF, restart it on each Greenplum Database segment host as described in Restarting PXF to apply the new setting.

Configure Hadoop Proxying

When PXF user personation is enabled (the default), you must configure the Hadoop core-site.xml configuration file to permit user impersonation for PXF. Follow these steps:

  1. On your Hadoop cluster, open the core-site.xml configuration file using a text editor, or use Ambari to add or edit the Hadoop property values described in this procedure.

  2. Set the property hadoop.proxyuser.<name>.hosts to specify the list of PXF host names from which proxy requests are permitted. Substitute the PXF proxy user (generally gpadmin) for <name>, and provide multiple PXF host names in a comma-separated list. For example:

    <property>
        <name>hadoop.proxyuser.gpadmin.hosts</name>
        <value>pxfhost1,pxfhost2,pxfhost3</value>
    </property>
    
  3. Set the property hadoop.proxyuser.<name>.groups to specify the list of HDFS groups that PXF can impersonate. You should limit this list to only those groups that require access to HDFS data from PXF. For example:

    <property>
        <name>hadoop.proxyuser.gpadmin.groups</name>
        <value>group1,group2</value>
    </property>
    
  4. After changing core-site.xml, you must restart Hadoop for your changes to take effect.

  5. Copy the updated core-site.xml file to the PXF Hadoop server configuration directory $PXF_CONF/servers/<server_name> on the master and synchronize the configuration to the standby master and each Greenplum Database segment host.

Hive User Impersonation

The PXF Hive connector uses the Hive MetaStore to determine the HDFS locations of Hive tables, and then accesses the underlying HDFS files directly. No specific impersonation configuration is required for Hive, because the Hadoop proxy configuration in core-site.xml also applies to Hive tables accessed in this manner.

HBase User Impersonation

In order for user impersonation to work with HBase, you must enable the AccessController coprocessor in the HBase configuration and restart the cluster. See 61.3 Server-side Configuration for Simple User Access Operation in the Apache HBase Reference Guide for the required hbase-site.xml configuration settings.