Configuring the JDBC Connector (Optional)

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 5.x documentation.

You can use PXF to access an external SQL database including MySQL, ORACLE, PostgreSQL, Hive, and Apache Ignite. This topic describes how to configure the PXF JDBC Connector to access these external data sources.

If you do not plan to use the PXF JDBC Connector, then you do not need to perform this procedure.

About JDBC Configuration

To access data in an external SQL database with the PXF JDBC Connector, you must:

  • Register a compatible JDBC driver JAR file
  • Specify the JDBC driver class name, database URL, and client credentials

In previous releases of Greenplum Database, you may have specified the JDBC driver class name, database URL, and client credentials via options in the CREATE EXTERNAL TABLE command. PXF now supports file-based server configuration for the JDBC Connector. This configuration, described below, allows you to specify these options and credentials in a file.

Note: PXF external tables that you previously created that directly specified the JDBC connection options will continue to work. If you want to move these tables to use JDBC file-based server configuration, you must create a server configuration, drop the external tables, and then recreate the tables specifying an appropriate SERVER=<server_name> clause.

JDBC Driver JAR Registration

The PXF JDBC Connector is installed with the postgresql-9.1-901-1.jdbc4.jar JAR file. If you require a different JDBC driver, ensure that you install the JDBC driver JAR file for the external SQL database in the $PXF_CONF/lib directory on each segment host. Be sure to install JDBC driver JAR files that are compatible with your JRE version. See Registering PXF JAR Dependencies for additional information.

JDBC Server Configuration

When you configure the PXF JDBC Connector, you add at least one named PXF server configuration for the connector as described in Configuring PXF Servers. You can also configure one or more statically-defined queries to run against the remote SQL database.

PXF provides a template configuration file for the JDBC Connector. This server template configuration file, located in $PXF_CONF/templates/jdbc-site.xml, identifies properties that you can configure to establish a connection to the external SQL database. The template also includes optional properties that you can set before executing query or insert commands in the external database session.

The required properties in the jdbc-site.xml server template file follow:

Property Description Value
jdbc.driver Class name of the JDBC driver. The JDBC driver Java class name; for example org.postgresql.Driver.
jdbc.url The URL that the JDBC driver uses to connect to the database. The database connection URL (database-specific); for example jdbc:postgresql://phost:pport/pdatabase.
jdbc.user The database user name. The user name for connecting to the database.
jdbc.password The password for jdbc.user. The password for connecting to the database.
When you configure a PXF JDBC server, you specify the external database user credentials to PXF in clear text in a configuration file.

Connection-Level Properties

To set additional JDBC connection-level properties, add jdbc.connection.property.<CPROP_NAME> properties to jdbc-site.xml. PXF passes these properties to the JDBC driver when it establishes the connection to the external SQL database (DriverManager.getConnection()).

Replace <CPROP_NAME> with the connection property name and specify its value:

Property Description Value
jdbc.connection.property.<CPROP_NAME> The name of a property (<CPROP_NAME>) to pass to the JDBC driver when PXF establishes the connection to the external SQL database. The value of the <CPROP_NAME> property.

Example: To set the createDatabaseIfNotExist connection property on a JDBC connection to a PostgreSQL database, include the following property block in jdbc-site.xml:

<property>
    <name>jdbc.connection.property.createDatabaseIfNotExist</name>
    <value>true</value>
 </property>

Ensure that the JDBC driver for the external SQL database supports any connection-level property that you specify.

Connection Transaction Isolation Property

The SQL standard defines four transaction isolation levels. The level that you specify for a given connection to an external SQL database determines how and when the changes made by one transaction executed on the connection are visible to another.

The PXF JDBC Connector exposes an optional server configuration property named jdbc.connection.transactionIsolation that enables you to specify the transaction isolation level. PXF sets the level (setTransactionIsolation()) just after establishing the connection to the external SQL database.

The JDBC Connector supports the following jdbc.connection.transactionIsolation property values:

SQL Level PXF Property Value
Read uncommitted READ_UNCOMMITTED
Read committed READ_COMMITTED
Repeatable Read REPEATABLE_READ
Serializable SERIALIZABLE

For example, to set the transaction isolation level to Read uncommitted, add the following property block to the jdbc-site.xml file:

<property>
    <name>jdbc.connection.transactionIsolation</name>
    <value>READ_UNCOMMITTED</value>
</property>

Different SQL databases support different transaction isolation levels. Ensure that the external database supports the level that you specify.

Statement-Level Properties

The PXF JDBC Connector executes a query or insert command on an external SQL database table in a statement. The Connector exposes properties that enable you to configure certain aspects of the statement before the command is executed in the external database. The Connector supports the following statement-level properties:

Property Description Value
jdbc.statement.batchSize The number of rows to write to the external database table in a batch. The number of rows. The default write batch size is 100.
jdbc.statement.fetchSize The number of rows to fetch/buffer when reading from the external database table. The number of rows. The default read fetch size is 1000.
jdbc.statement.queryTimeout The amount of time (in seconds) the JDBC driver waits for a statement to execute. This timeout applies to statements created for both read and write operations. The timeout duration in seconds. The default wait time is unlimited.

PXF uses the default value for any statement-level property that you do not explicitly configure.

Example: To set the read fetch size to 5000, add the following property block to jdbc-site.xml:

<property>
    <name>jdbc.statement.fetchSize</name>
    <value>5000</value>
</property>

Ensure that the JDBC driver for the external SQL database supports any statement-level property that you specify.

Session-Level Properties

To set session-level properties, add the jdbc.session.property.<SPROP_NAME> property to jdbc-site.xml. PXF will SET these properties in the external database before executing a query.

Replace <SPROP_NAME> with the session property name and specify its value:

Property Description Value
jdbc.session.property.<SPROP_NAME> The name of a session property (<SPROP_NAME>) to set before query execution. The value of the <SPROP_NAME> property.

Note: The PXF JDBC Connector passes both the session property name and property value to the external SQL database exactly as specified in the jdbc-site.xml server configuration file. To limit the potential threat of SQL injection, the Connector rejects any property name or value that contains the ;, \n, \b, or \0 characters.

The PXF JDBC Connector handles the session property SET syntax for all supported external SQL databases.

Example: To set the search_path parameter before running a query in a PostgreSQL database, add the following property block to jdbc-site.xml:

<property>
    <name>jdbc.session.property.search_path</name>
    <value>public</value>
</property>

Ensure that the JDBC driver for the external SQL database supports any property that you specify.

About JDBC User Impersonation

The PXF JDBC Connector uses the jdbc.user setting or information in the jdbc.url to determine the identity of the user to connect to the external data store. When PXF JDBC user impersonation is disabled (the default), the behavior of the JDBC Connector is further dependent upon the external data store. For example, if you are using the JDBC Connector to access Hive, the Connector uses the settings of certain Hive authentication and impersonation properties to determine the user. You may be required to provide a jdbc.user setting, or add properties to the jdbc.url setting in the server jdbc-site.xml file.

When you enable PXF JDBC user impersonation, the PXF JDBC Connector accesses the external data store on behalf of a Greenplum Database end user. The Connector uses the name of the Greenplum Database user that accesses the PXF external table to try to connect to the external data store.

The pxf.impersonation.jdbc property governs JDBC user impersonation. JDBC user impersonation is disabled by default. To enable JDBC user impersonation for a server configuration, set the property to true:

<property>
    <name>pxf.impersonation.jdbc</name>
    <value>true</value>
</property>

When you enable JDBC user impersonation for a PXF server, PXF overrides the value of a jdbc.user property setting defined in either jdbc-site.xml or <greenplum_user_name>-user.xml, or specified in the external table DDL, with the Greenplum Database user name. For user impersonation to work effectively when the external data store requires passwords to authenticate connecting users, you must specify the jdbc.password setting for each user that can be impersonated in that user’s <greenplum_user_name>-user.xml property override file. Refer to Configuring a PXF User for more information about per-server, per-Greenplum-user configuration.

JDBC Named Query Configuration

A PXF named query is a static query that you configure, and that PXF runs in the remote SQL database.

To configure and use a PXF JDBC named query:

  1. You define the query in a text file.
  2. You provide the query name to Greenplum Database users.
  3. The Greenplum Database user references the query in a Greenplum Database external table definition.

PXF runs the query each time the user invokes a SELECT command on the Greenplum Database external table.

Defining a Named Query

You create a named query by adding the query statement to a text file that has the following naming format: <query_name>.sql. You can define one or more named queries for a JDBC server configuration. Each query must reside in a separate text file.

You must place a query text file in the PXF JDBC server configuration directory from which it will be accessed. If you want to make the query available to more than one JDBC server configuration, you must copy the query text file to the configuration directory for each JDBC server.

The query text file must contain a single query that you want to run in the remote SQL database. You must contruct the query in accordance with the syntax supported by the database.

For example, if a MySQL database has a customers table and an orders table, you could include the following SQL statement in a query text file:

SELECT c.name, c.city, sum(o.amount) AS total, o.month
  FROM customers c JOIN orders o ON c.id = o.customer_id
  WHERE c.state = 'CO'
GROUP BY c.name, c.city, o.month

Do not provide the ending semicolon (;) for the SQL statement.

Query Naming

The Greenplum Database user references a named query by specifying the query file name without the extension. For example, if you define a query in a file named report.sql, the name of that query is report.

Named queries are associated with a specific JDBC server configuration. You will provide the available query names to the Greenplum Database users that you allow to create external tables using the server configuration.

Referencing a Named Query

The Greenplum Database user specifies query:<query_name> rather than the name of a remote SQL database table when they create the external table. For example, if the query is defined in the file $PXF_CONF/servers/mydb/report.sql, the CREATE EXTERNAL TABLE LOCATION clause would include the following components:

LOCATION ('pxf://query:report?PROFILE=JDBC&SERVER=mydb ...')

Refer to About Using Named Queries for information about using PXF JDBC named queries.

Overriding the JDBC Server Configuration

You can override the JDBC server configuration by directly specifying certain JDBC properties via custom options in the CREATE EXTERNAL TABLE command LOCATION clause. Refer to Overriding the JDBC Server Configuration via DDL for additional information.

Example Configuration Procedure

Ensure that you have initialized PXF before you configure a JDBC Connector server.

In this procedure, you name and add a PXF JDBC server configuration for a PostgreSQL database and synchronize the server configuration(s) to the Greenplum Database cluster.

  1. Log in to your Greenplum Database master node:

    $ ssh gpadmin@<gpmaster>
    
  2. Choose a name for the JDBC server. You will provide the name to Greenplum users that you choose to allow to reference tables in the external SQL database as the configured user.

    Note: The server name default is reserved.

  3. Create the $PXF_HOME/servers/<server_name> directory. For example, use the following command to create a JDBC server configuration named pg_user1_testdb:

    gpadmin@gpmaster$ mkdir $PXF_CONF/servers/pg_user1_testdb
    
  4. Copy the PXF JDBC server template file to the server configuration directory. For example:

    gpadmin@gpmaster$ cp $PXF_CONF/templates/jdbc-site.xml $PXF_CONF/servers/pg_user1_testdb/
    
  5. Open the template server configuration file in the editor of your choice, and provide appropriate property values for your environment. For example, if you are configuring access to a PostgreSQL database named testdb on a PostgreSQL instance running on the host named pgserverhost for the user named user1:

    <?xml version="1.0" encoding="UTF-8"?>
    <configuration>
        <property>
            <name>jdbc.driver</name>
            <value>org.postgresql.Driver</value>
        </property>
        <property>
            <name>jdbc.url</name>
            <value>jdbc:postgresql://pgserverhost:5432/testdb</value>
        </property>
        <property>
            <name>jdbc.user</name>
            <value>user1</value>
        </property>
        <property>
            <name>jdbc.password</name>
            <value>changeme</value>
        </property>
    </configuration>
    
  6. Save your changes and exit the editor.

  7. Use the pxf cluster sync command to copy the new server configuration to the Greenplum Database cluster. For example:

    gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
    

Configuring Hive Access

You can use the PXF JDBC Connector to retrieve data from Hive. You can also use a JDBC named query to submit a custom SQL query to Hive and retrieve the results using the JDBC Connector.

This topic describes how to configure the PXF JDBC Connector to access Hive. When you configure Hive access with JDBC, you must take into account the Hive user impersonation setting, as well as whether or not the Hadoop cluster is secured with Kerberos.

JDBC Server Configuration

The PXF JDBC Connector is installed with the JAR files required to access Hive via JDBC, hive-jdbc-<version>.jar and hive-service-<version>.jar, and automatically registers these JARs.

When you configure a PXF JDBC server for Hive access, you must specify the JDBC driver class name, database URL, and client credentials just as you would when configuring a client connection to an SQL database.

To access Hive via JDBC, you must specify the following properties and values in the jdbc-site.xml server configuration file:

Property Value
jdbc.driver org.apache.hive.jdbc.HiveDriver
jdbc.url jdbc:hive2://<hiveserver2_host>:<hiveserver2_port>/<database>

The value of the HiveServer2 authentication (hive.server2.authentication) and impersonation (hive.server2.enable.doAs) properties, and whether or not the Hive service is utilizing Kerberos authentication, will inform the setting of other JDBC server configuration properties. These properties are defined in the hive-site.xml configuration file in the Hadoop cluster. You will need to obtain the values of these properties.

The following table enumerates the Hive2 authentication and impersonation combinations supported by the PXF JDBC Connector. It identifies the possible Hive user identities and the JDBC server configuration required for each.

Table heading key:

  • authentication -> Hive hive.server2.authentication Setting
  • enable.doAs -> Hive hive.server2.enable.doAs Setting
  • User Identity -> Identity that HiveServer2 will use to access data
  • Configuration Required -> PXF JDBC Connector or Hive configuration required for User Identity
authentication enable.doAs User Identity Configuration Required
NOSASL n/a No authentication Must set jdbc.connection.property.auth = noSasl
NONE, or not specified TRUE User name that you provide Set jdbc.user
NONE, or not specified TRUE Greenplum user name Set pxf.impersonation.jdbc = true
NONE, or not specified FALSE Name of the user who started Hive, typically hive None
KERBEROS TRUE Identity provided in the PXF Kerberos principal, typically gpadmin None
KERBEROS TRUE User name that you provide Set hive.server2.proxy.user in jdbc.url
KERBEROS TRUE Greenplum user name Set pxf.impersonation.jdbc = true
KERBEROS FALSE Identity provided in the PXF Kerberos principal, typically gpadmin None

Note: There are additional configuration steps required when Hive utilizes Kerberos authentication.

Example Configuration Procedure

Perform the following procedure to configure a PXF JDBC server for Hive:

  1. Log in to your Greenplum Database master node:

    $ ssh gpadmin@<gpmaster>
    
  2. Choose a name for the JDBC server.

  3. Create the $PXF_HOME/servers/<server_name> directory. For example, use the following command to create a JDBC server configuration named hivejdbc1:

    gpadmin@gpmaster$ mkdir $PXF_CONF/servers/hivejdbc1
    
  4. Copy the PXF JDBC server template file to the server configuration directory. For example:

    gpadmin@gpmaster$ cp $PXF_CONF/templates/jdbc-site.xml $PXF_CONF/servers/hivejdbc1/
    
  5. Open the jdbc-site.xml file in the editor of your choice and set the jdbc.driver and jdbc.url properties. Be sure to specify your Hive host, port, and database name:

    <property>
        <name>jdbc.driver</name>
        <value>org.apache.hive.jdbc.HiveDriver</value>
    </property>
    <property>
        <name>jdbc.url</name>
        <value>jdbc:hive2://<hiveserver2_host>:<hiveserver2_port>/<database></value>
    </property>
    
  6. Obtain the hive-site.xml file from your Hadoop cluster and examine the file.

  7. If the hive.server2.authentication property in hive-site.xml is set to NOSASL, HiveServer2 performs no authentication. Add the following connection-level property to jdbc-site.xml:

    <property>
        <name>jdbc.connection.property.auth</name>
        <value>noSasl</value>
    </property>
    

    Alternatively, you may choose to add ;auth=noSasl to the jdbc.url.

  8. If the hive.server2.authentication property in hive-site.xml is set to NONE, or the property is not specified, you must set the jdbc.user property. The value to which you set the jdbc.user property is dependent upon the hive.server2.enable.doAs impersonation setting in hive-site.xml:

    1. If hive.server2.enable.doAs is set to TRUE (the default), Hive runs Hadoop operations on behalf of the user connecting to Hive. Choose/perform one of the following options:

      Set jdbc.user to specify the user that has read permission on all Hive data accessed by Greenplum Database. For example, to connect to Hive and run all requests as user gpadmin:

      <property>
          <name>jdbc.user</name>
          <value>gpadmin</value>
      </property>
      

      Or, turn on JDBC-level user impersonation so that PXF automatically uses the Greenplum Database user name to connect to Hive:

      <property>
          <name>pxf.impersonation.jdbc</name>
          <value>true</value>
      </property>
      

      If you enable JDBC impersonation in this manner, you must not specify a jdbc.user nor include the setting in the jdbc.url.

    2. If required, create a PXF user configuration file to manage the password setting.

    3. If hive.server2.enable.doAs is set to FALSE, Hive runs Hadoop operations as the user who started the HiveServer2 process, usually the user hive. PXF ignores the jdbc.user setting in this circumstance.

  9. If the hive.server2.authentication property in hive-site.xml is set to KERBEROS:

    1. Ensure that you have enabled Kerberos authentication for PXF as described in Configuring PXF for Secure HDFS.
    2. Ensure that you have configured the Hadoop cluster as the default PXF server.
    3. Ensure that the $PXF_CONF/servers/default/core-site.xml file includes the following setting:

      <property>
          <name>hadoop.security.authentication</name>
          <value>kerberos</value>
      </property>
      
    4. Add the saslQop property to jdbc.url, and set it to match the hive.server2.thrift.sasl.qop property setting in hive-site.xml. For example, if the hive-site.xml file includes the following property settting:

      <property>
          <name>hive.server2.thrift.sasl.qop</name>
          <value>auth-conf</value>
      </property>
      

      You would add ;saslQop=auth-conf to the jdbc.url.

    5. Add the HiverServer2 principal name to the jdbc.url. For example:

      jdbc:hive2://hs2server:10000/default;principal=hive/hs2server@REALM;saslQop=auth-conf
      
    6. If hive.server2.enable.doAs is set to TRUE (the default), Hive runs Hadoop operations on behalf of the user connecting to Hive. Choose/perform one of the following options:

      Do not specify any additional properties. In this case, PXF initiates all Hadoop access with the identity provided in the PXF Kerberos principal (usually gpadmin).

      Or, set the hive.server2.proxy.user property in the jdbc.url to specify the user that has read permission on all Hive data. For example, to connect to Hive and run all requests as the user named integration use the following jdbc.url:

      jdbc:hive2://hs2server:10000/default;principal=hive/hs2server@REALM;saslQop=auth-conf;hive.server2.proxy.user=integration
      

      Or, enable PXF JDBC impersonation in the jdbc-site.xml file so that PXF automatically uses the Greenplum Database user name to connect to Hive. For example:

      <property>
          <name>pxf.impersonation.jdbc</name>
          <value>true</value>
      </property>
      

      If you enable JDBC impersonation, you must not explicitly specify a hive.server2.proxy.user in the jdbc.url.

    7. If required, create a PXF user configuration file to manage the password setting.

    8. If hive.server2.enable.doAs is set to FALSE, Hive runs Hadoop operations with the identity provided by the PXF Kerberos principal (usually gpadmin).

  10. Save your changes and exit the editor.

  11. Use the pxf cluster sync command to copy the new server configuration to the Greenplum Database cluster. For example:

    gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync