Configuring Your Systems and Installing Greenplum

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 5.x documentation.

Configuring Your Systems and Installing Greenplum

Describes how to prepare your operating system environment for Greenplum, and install the Greenplum Database software binaries on all of the hosts that will comprise your Greenplum Database system.

Perform the following tasks in order:

  1. Make sure your systems meet the System Requirements
  2. Setting the Greenplum Recommended OS Parameters
  3. Creating the Greenplum Database Administrative User Account
  4. Installing the Greenplum Database Software
  5. Installing and Configuring Greenplum on all Hosts
  6. (Optional) Installing Oracle Compatibility Functions
  7. (Optional) Installing Optional Modules
  8. (Optional) Installing Greenplum Database Extensions
  9. (Optional) Installing and Configuring the Greenplum Platform Extension Framework (PXF)
  10. Creating the Data Storage Areas
  11. Synchronizing System Clocks
  12. Next Steps

Unless noted, these tasks should be performed for all hosts in your Greenplum Database array (master, standby master and segments).

For information about running Greenplum Database in the cloud see Cloud Services in the Pivotal Greenplum Partner Marketplace.

Important: When data loss is not acceptable for a Pivotal Greenplum Database cluster, master and segment mirroring must be enabled in order for the cluster to be supported by Pivotal. Without mirroring, system and data availability is not guaranteed, Pivotal will make best efforts to restore a cluster in this case. For information about master and segment mirroring, see About Redundancy and Failover in the Greenplum Database Administrator Guide.
Note: For information about upgrading Pivotal Greenplum Database from a previous version, see the Greenplum Database Release Notes for the release that you are installing.

System Requirements

The following table lists minimum recommended specifications for servers intended to support Greenplum Database on Linux systems in a production environment. All servers in your Greenplum Database system must have the same hardware and software configuration. Greenplum also provides hardware build guides for its certified hardware platforms. It is recommended that you work with a Greenplum Systems Engineer to review your anticipated environment to ensure an appropriate hardware configuration for Greenplum Database.

Table 1. System Prerequisites for Greenplum Database 5.0
Operating System
Note: See the Greenplum Database Release Notes for current supported platform information.
File Systems
  • xfs required for data storage on SUSE Linux and Red Hat (ext3 supported for root file system)
Minimum CPU Pentium Pro compatible (P3/Athlon and above)
Minimum Memory 16 GB RAM per server
Disk Requirements
  • 150MB per host for Greenplum installation
  • Approximately 300MB per segment instance for meta data
  • Appropriate free space for data with disks at no more than 70% capacity
  • High-speed, local storage
Network Requirements 10 Gigabit Ethernet within the array

Dedicated, non-blocking switch

NIC bonding is recommended when multiple interfaces are present

Software and Utilities zlib compression libraries

bash shell

GNU tars

GNU zip

GNU sed (used by Greenplum Database gpinitsystem)

perl

secure shell

Important: SSL is supported only on the Greenplum Database master host system. It is not supported on the segment host systems.
Important: For all Greenplum Database host systems, the SELinux must be disabled. You should also disable firewall software such as iptables (on systems such as RHEL 6.x and CentOS 6.x ) or firewalld (on systems such as RHEL 7.x and CentOS 7.x). You can enable firewall software if it is required for security purposes. For information about enabling and configuring iptables, see Enabling iptables. For information about enabling and configuring firewalld see the list of instructions here:
  • This command checks the status of SELinux when run as root:

    # sestatus
    SELinuxstatus: disabled

    You can disable SELinux by editing the /etc/selinux/config file. As root, change the value of the SELINUX parameter in the config file and reboot the system:

    SELINUX=disabled

    For information about disabling firewall software, see the documentation for the firewall or your operating system. For information about disabling SELinux, see the SELinux documentation.

  • This command checks the status of iptables when run as root:
    # /sbin/chkconfig --list iptables

    This is the output if iptables is disabled.

    iptables 0:off 1:off 2:off 3:off 4:off 5:off 6:off

    One method of disabling iptables is to become root, run this command, and then reboot the system:

    /sbin/chkconfig iptables off
  • This command checks the status of firewalld when run as root:

    # systemctl status firewalld

    This is the output if firewalld is disabled.

    * firewalld.service - firewalld - dynamic firewall daemon
       Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
       Active: inactive (dead)

    These commands disable firewalld when run as root:

    # systemctl stop firewalld.service
    # systemctl disable firewalld.service

Setting the Greenplum Recommended OS Parameters

Greenplum requires the certain Linux operating system (OS) parameters be set on all hosts in your Greenplum Database system (masters and segments).

In general, the following categories of system parameters need to be altered:

  • Shared Memory - A Greenplum Database instance will not work unless the shared memory segment for your kernel is properly sized. Most default OS installations have the shared memory values set too low for Greenplum Database. On Linux systems, you must also disable the OOM (out of memory) killer. For information about Greenplum Database shared memory requirements, see the Greenplum Database server configuration parameter shared_buffers in the Greenplum Database Reference Guide.
  • Network - On high-volume Greenplum Database systems, certain network-related tuning parameters must be set to optimize network connections made by the Greenplum interconnect.
  • User Limits - User limits control the resources available to processes started by a user's shell. Greenplum Database requires a higher limit on the allowed number of file descriptors that a single process can have open. The default settings may cause some Greenplum Database queries to fail because they will run out of file descriptors needed to process the query.

Linux System Settings

  • Edit the /etc/hosts file and make sure that it includes the host names and all interface address names for every machine participating in your Greenplum Database system.
  • Set the following parameters in the /etc/sysctl.conf file and reboot:
    kernel.shmmax = 500000000
    kernel.shmmni = 4096
    kernel.shmall = 4000000000
    kernel.sem = 500 1024000 200 4096
    kernel.sysrq = 1
    kernel.core_uses_pid = 1
    kernel.msgmnb = 65536
    kernel.msgmax = 65536
    kernel.msgmni = 2048
    net.ipv4.tcp_syncookies = 1
    net.ipv4.conf.default.accept_source_route = 0
    net.ipv4.tcp_tw_recycle = 1
    net.ipv4.tcp_max_syn_backlog = 4096
    net.ipv4.conf.all.arp_filter = 1
    net.ipv4.ip_local_port_range = 10000 65535
    net.core.netdev_max_backlog = 10000
    net.core.rmem_max = 2097152
    net.core.wmem_max = 2097152
    vm.overcommit_memory = 2
    Note: Azure deployments require Greenplum Database to not use port 65330. Add the following line to sysctl.conf:
    net.ipv4.ip_local_reserved_ports=65330 
    Note: When vm.overcommit_memory is 2, you specify a value for vm.overcommit_ratio. For information about calculating the value for vm.overcommit_ratio when using resource queue-based resource management, see the Greenplum Database server configuration parameter gp_vmem_protect_limit in the Greenplum Database Reference Guide. If you are using resource group-based resource management, the operating system vm.overcommit_ratio default value is a good starting point.

    To avoid port conflicts between Greenplum Database and other applications when initializing Greenplum Database, do not specify Greenplum Database ports in the range specified by the operating system parameter net.ipv4.ip_local_port_range. For example, if net.ipv4.ip_local_port_range = 10000 65535, you could set the Greenplum Database base port numbers to these values.

    PORT_BASE = 6000
    MIRROR_PORT_BASE = 7000
    REPLICATION_PORT_BASE = 8000
    MIRROR_REPLICATION_PORT_BASE = 9000

    For information about the port ranges that are used by Greenplum Database, see gpinitsystem.

  • Set the following parameters in the /etc/security/limits.conf file:
    * soft nofile 65536
    * hard nofile 65536
    * soft nproc 131072
    * hard nproc 131072

    For Red Hat Enterprise Linux (RHEL) 6.x and CentOS 6.x, parameter values in the /etc/security/limits.d/90-nproc.conf file override the values in the limits.conf file. If a parameter value is set in both conf files, ensure that the parameter is set properly in the 90-nproc.conf file. The Linux module pam_limits sets user limits by reading the values from the limits.conf file and then from the 90-nproc.conf file. For information about PAM and user limits, see the documentation on PAM and pam_limits.

  • XFS is the preferred file system on Linux platforms for data storage. The following XFS mount options are recommended:
    rw,nodev,noatime,nobarrier,inode64

    See the manual page (man) for the mount command for more information about using that command (man mount opens the man page).

    The XFS options can also be set in the /etc/fstab file. This example entry from an fstab file specifies the XFS options.

    /dev/data /data xfs nodev,noatime,nobarrier,inode64 0 0
  • Each disk device file should have a read-ahead (blockdev) value of 16384.

    To verify the read-ahead value of a disk device:

    # /sbin/blockdev --getra devname

    For example:

    # /sbin/blockdev --getra /dev/sdb

    To set blockdev (read-ahead) on a device:

    # /sbin/blockdev --setra bytes devname

    For example:

    # /sbin/blockdev --setra 16384 /dev/sdb

    See the manual page (man) for the blockdev command for more information about using that command (man blockdev opens the man page).

  • The Linux disk I/O scheduler for disk access supports different policies, such as CFQ, AS, and deadline.

    The deadline scheduler option is recommended. To specify a scheduler until the next system reboot, run the following:

    # echo schedulername > /sys/block/devname/queue/scheduler

    For example:

    # echo deadline > /sys/block/sbd/queue/scheduler

    You can specify the I/O scheduler at boot time with the elevator kernel parameter. Add the parameter elevator=deadline to the kernel command in the file /boot/grub/grub.conf, the GRUB boot loader configuration file. This is an example kernel command from a grub.conf file on RHEL 6.x or CentOS 6.x. The command is on multiple lines for readability.

    kernel /vmlinuz-2.6.18-274.3.1.el5 ro root=LABEL=/
        elevator=deadline crashkernel=128M@16M  quiet console=tty1
        console=ttyS1,115200 panic=30 transparent_hugepage=never 
        initrd /initrd-2.6.18-274.3.1.el5.img
    To specify the I/O scheduler at boot time on systems that use grub2 such as RHEL 7.x or CentOS 7.x, use the system utility grubby. This command adds the parameter when run as root.
    # grubby --update-kernel=ALL --args="elevator=deadline"

    After adding the parameter, reboot the system.

    This grubby command displays kernel parameter settings.
    # grubby --info=ALL

    For more information about the grubby utility, see your operating system documentation. If the grubby command does not update the kernels, see the Note at the end of the section.

  • Disable Transparent Huge Pages (THP). RHEL 6.0 or higher enables THP by default. THP degrades Greenplum Database performance. One way to disable THP on RHEL 6.x is by adding the parameter transparent_hugepage=never to the kernel command in the file /boot/grub/grub.conf, the GRUB boot loader configuration file. This is an example kernel command from a grub.conf file. The command is on multiple lines for readability:
    kernel /vmlinuz-2.6.18-274.3.1.el5 ro root=LABEL=/
        elevator=deadline crashkernel=128M@16M  quiet console=tty1
        console=ttyS1,115200 panic=30 transparent_hugepage=never 
        initrd /initrd-2.6.18-274.3.1.el5.img
    On systems that use grub2 such as RHEL 7.x or CentOS 7.x, use the system utility grubby. This command adds the parameter when run as root.
    # grubby --update-kernel=ALL --args="transparent_hugepage=never"

    After adding the parameter, reboot the system.

    This cat command checks the state of THP. The output indicates that THP is disabled.
    $ cat /sys/kernel/mm/*transparent_hugepage/enabled
    always [never]

    For more information about Transparent Huge Pages or the grubby utility, see your operating system documentation. If the grubby command does not update the kernels, see the Note at the end of the section.

  • Disable IPC object removal for RHEL 7.2 or CentOS 7.2. The default systemd setting RemoveIPC=yes removes IPC connections when non-system user accounts log out. This causes the Greenplum Database utility gpinitsystem to fail with semaphore errors. Perform one of the following to avoid this issue.
    • When you add the gpadmin operating system user account to the master node in Creating the Greenplum Database Administrative User Account, create the user as a system account. You must also add the gpadmin user as a system account on the segment hosts manually or using the gpseginstall command (described in later installation step Installing and Configuring Greenplum on all Hosts).
      Note: When you run the gpseginstall utility as the root user to install Greenplum Database on host systems, the utility creates the gpadmin operating system user as a system account on the hosts.
    • Disable RemoveIPC. Set this parameter in /etc/systemd/logind.conf on the Greenplum Database host systems.
      RemoveIPC=no

      The setting takes effect after restarting the systemd-login service or rebooting the system. To restart the service, run this command as the root user.

      service systemd-logind restart
  • Certain Greenplum Database management utilities including gpexpand, gpinitsystem, and gpaddmirrors, utilize secure shell (SSH) connections between systems to perform their tasks. In large Greenplum Database deployments, cloud deployments, or deployments with a large number of segments per host, these utilities may exceed the host's maximum threshold for unauthenticated connections. When this occurs, you receive errors such as: ssh_exchange_identification: Connection closed by remote host..

    To increase this connection threshold for your Greenplum Database deployment, update the SSH MaxStartups configuration parameter in one of the /etc/ssh/sshd_config or /etc/sshd_config SSH daemon configuration files.

    If you specify MaxStartups using a single integer value, you identify the maximum number of concurrent unauthenticated connections. For example:
    MaxStartups 200
    If you specify MaxStartups using the "start:rate:full" syntax, you enable random early connection drop by the SSH daemon. start identifies the maximum number of unathenticated SSH connection attempts allowed. Once start number of unauthenticated connection attempts is reached, the SSH daemon refuses rate percent of subsequent connection attempts. full identifies the maximum number of unauthenticated connection attempts after which all attempts are refused. For example:
    Max Startups 10:30:200
    Restart the SSH daemon after you update MaxStartups. For example, on a CentOS 6 system, run the following command as the root user:
    # service sshd restart

    For detailed information about SSH configuration options, refer to the SSH documentation for your Linux distribution.

  • On some SUSE Linux Enterprise Server platforms, the Greenplum Database utility gpssh fails with the error message out of pty devices. A workaround is to add Greenplum Database operating system users, for example gpadmin, to the tty group. On SUSE systems, tty is required to run gpssh
Note: If the grubby command does not update the kernels of a RHEL 7.x or CentOS 7.x system, you can manually update all kernels on the system. For example, to add the parameter transparent_hugepage=never to all kernels on a system.
  1. Add the parameter to the GRUB_CMDLINE_LINUX line in the file parameter in /etc/default/grub.
    GRUB_TIMEOUT=5
    GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
    GRUB_DEFAULT=saved
    GRUB_DISABLE_SUBMENU=true
    GRUB_TERMINAL_OUTPUT="console"
    GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet transparent_hugepage=never"
    GRUB_DISABLE_RECOVERY="true"
  2. As root, run the grub2-mkconfig command to update the kernels.
    # grub2-mkconfig -o /boot/grub2/grub.cfg
  3. Reboot the system.

Creating the Greenplum Database Administrative User Account

You must create a dedicated operating system user account on the master node to run Greenplum Database. You administer Greenplum Database as this operating system user. This user account is named, by convention, gpadmin.

Note: If you are installing the Greenplum Database RPM distribution, create the gpadmin user on every host in the Greenplum Database cluster because the installer does not create the gpadmin user for you. See the note underInstalling the Greenplum Database Software for more information.

You cannot run the Greenplum Database server as root.

The gpadmin user account must have permission to access the services and directories required to install and run Greenplum Database.

To create the gpadmin operating system user account, run the groupadd, useradd, and passwd commands as the root user.

Note: If you are installing Greenplum Database on RHEL 7.2 or CentOS 7.2 and chose to disable IPC object removal by creating the gpadmin user as a system account, provide both the -r and the -m <home_dir> options to the useradd command.
For example:
# groupadd gpadmin
# useradd gpadmin -g gpadmin
# passwd gpadmin
New password: <changeme>
Retype new password: <changeme>

Installing the Greenplum Database Software

Pivotal distributes the Greenplum Database software both as a downloadable RPM file and as a binary installer. You can use either distribution to install the software, but there are important differences between the two installation methods:
  • If you use the RPM distribution, install the RPM file on the master, standby master, and every segment host. You will need to create the gpadmin user on every host. (See Creating the Greenplum Database Administrative User Account.) After the RPM file is installed on every host, you must enable passwordless SSH access for the gpadmin user from each host to every other host.
  • If you use the binary installer, you can install the distribution on the master host only, and then use the Greenplum Database gpseginstall utility to copy the installation from the master host to all other hosts in the cluster. The gpseginstall utility creates the gpadmin user on each host, if it does not already exist, and enables passwordless SSH for the gpadmin user.
Warning: It is possible to install the RPM distribution on the master host, and then use the gpseginstall utility to copy the Greenplum Database installation directory to all other hosts. However, this is not recommended because gpseginstall does not install the RPM package on the other hosts, so you will be unable to use the OS package management utilities to remove or upgrade the Greenplum software on the standby master host or segment hosts.

If you do not have root access on the master host machine, run the binary installer as the gpadmin user and install the software into a directory in which you have write permission.

Installing the RPM Distribution

Perform these steps on the master host, standby master host, and every segment host in the Greenplum Database cluster.

  1. Log in as root.
  2. Download or copy the RPM distribution file to the host machine. The RPM distribution filename has the format greenplum-db-<version>-<platform>.rpm where <platform> is similar to rhel7-x86_64 (Red Hat 64-bit) or sles11-x86_64 (SUSE Linux 64-bit).
  3. Install the local RPM file:
    # rpm -ivh ./greenplum-db-<version>-<platform>.rpm
    Preparing...                ########################################### [100%]
       1:greenplum-db           ########################################### [100%]

    The RPM installation copies the Greenplum Database software into a version-specific directory, /usr/local/greenplum-db-<version>.

  4. Change the ownership and group of the installed files to gpadmin:
    # chown -R gpadmin /usr/local/greenplum*
    # chgrp -R gpadmin /usr/local/greenplum*

Enabling Passwordless SSH

After the RPM has been installed on all hosts in the cluster, use the gpssh-exkeys utility to set up passwordless SSH for the gpadmin user.

  1. Log in to the master host as the gpadmin user.
  2. Source the path file in the Greenplum Database installation directory.
    $ source /usr/local/greenplum-db-<version>/greenplum_path.sh
  3. In the gpadmin home directory, create a file named hostfile_exkeys that has the machine configured host names and host addresses (interface names) for each host in your Greenplum system (master, standby master, and segment hosts). Make sure there are no blank lines or extra spaces. Check the /etc/hosts file on your systems for the correct host names to use for your environment. For example, if you have a master, standby master, and three segment hosts with two unbonded network interfaces per host, your file would look something like this:
    mdw
    mdw-1
    mdw-2
    smdw
    smdw-1
    smdw-2
    sdw1
    sdw1-1
    sdw1-2
    sdw2
    sdw2-1
    sdw2-2
    sdw3
    sdw3-1
    sdw3-2
  4. Run the gpssh-exkeys utility with your hostfile_exkeys file to enable passwordless SSH for the gpadmin user.
    $ gpssh-exkeys -f hostfile_exkeys
Note: You can run the gpssh-exkeys utility again as the root user if you want to enable passwordless SSH for root.

Follow the steps in Confirming Your Installation to verify that the Greenplum Database software is installed correctly.

Installing the Binary Distribution

  1. Log in as root on the machine that will become the Greenplum Database master host.

    If you do not have root access on the master host machine, run the binary installer as the gpadmin user and install the software into a directory in which you have write permission.

  2. Download or copy the Binary Installation distribution file to the master host machine. The Binary Installer distribution filename has the format greenplum-db-<version>-<platform>.zip where <platform> is similar to RHEL7-x86_64 (Red Hat 64-bit) or SuSE12-x86_64 (SuSe Linux 64 bit).
  3. Unzip the installer file:
    # unzip greenplum-db-<version>-<platform>.zip
  4. Launch the installer using bash:
    # /bin/bash greenplum-db-<version>-<platform>.bin
  5. The installer prompts you to accept the Greenplum Database license agreement. Type yes to accept the license agreement.
  6. The installer prompts you to provide an installation path. Press ENTER to accept the default install path (/usr/local/greenplum-db-<version>), or enter an absolute path to a custom install location. You must have write permission to the location you specify.
  7. The installer installs the Greenplum Database software and creates a greenplum-db symbolic link one directory level above the version-specific installation directory. The symbolic link is used to facilitate patch maintenance and upgrades between versions. The installed location is referred to as $GPHOME.
  8. If you installed as root, change the ownership and group of the installed files to gpadmin:
    # chown -R gpadmin /usr/local/greenplum*
    # chgrp -R gpadmin /usr/local/greenplum*
  9. To perform additional required system configuration tasks and to install Greenplum Database on other hosts, go to the next task Installing and Configuring Greenplum on all Hosts.

About Your Greenplum Database Installation

  • greenplum_path.sh — This file contains the environment variables for Greenplum Database. See Setting Greenplum Environment Variables.
  • bin — This directory contains the Greenplum Database management utilities. This directory also contains the PostgreSQL client and server programs, most of which are also used in Greenplum Database.
  • docs/cli_help — This directory contains help files for Greenplum Database command-line utilities.
  • docs/cli_help/gpconfigs — This directory contains sample gpinitsystem configuration files and host files that can be modified and used when installing and initializing a Greenplum Database system.
  • docs/javadoc — This directory contains javadocs for the gNet extension (gphdfs protocol). The jar files for the gNet extension are installed in the $GPHOME/lib/hadoop directory.
  • etc — Sample configuration file for OpenSSL and a sample configuration file to be used with the gpcheck management utility.
  • ext — Bundled programs (such as Python) used by some Greenplum Database utilities.
  • include — The C header files for Greenplum Database.
  • lib — Greenplum Database and PostgreSQL library files.
  • sbin — Supporting/Internal scripts and programs.
  • share — Shared files for Greenplum Database.

Installing and Configuring Greenplum on all Hosts

When run as root, gpseginstall copies the Greenplum Database installation from the current host and installs it on a list of specified hosts, creates the Greenplum operating system user account (typically named gpadmin), sets the account password (default is changeme), sets the ownership of the Greenplum Database installation directory, and exchanges ssh keys between all specified host address names (both as root and as the specified user account).

Note: If you are setting up a single node system, you can still use gpseginstall to perform the required system configuration tasks on the current host. In this case, the hostfile_exkeys should have only the current host name.

To install and configure Greenplum Database on all specified hosts

  1. Log in to the master host as root:
    $ su -
  2. Source the path file from your master host's Greenplum Database installation directory:
    # source /usr/local/greenplum-db/greenplum_path.sh
  3. Create a file called hostfile_exkeys that has the machine configured host names and host addresses (interface names) for each host in your Greenplum system (master, standby master and segments). Make sure there are no blank lines or extra spaces. For example, if you have a master, standby master and three segments with two unbonded network interfaces per host, your file would look something like this:
    mdw
    mdw-1
    mdw-2
    smdw
    smdw-1
    smdw-2
    sdw1
    sdw1-1
    sdw1-2
    sdw2
    sdw2-1
    sdw2-2
    sdw3
    sdw3-1
    sdw3-2

    Check the /etc/hosts file on your systems for the correct host names to use for your environment.

    The Greenplum Database segment host naming convention is sdwN where sdw is a prefix and N is an integer. For example, segment host names would be sdw1, sdw2 and so on. NIC bonding is recommended for hosts with multiple interfaces, but when the interfaces are not bonded, the convention is to append a dash (-) and number to the host name. For example, sdw1-1 and sdw1-2 are the two interface names for host sdw1.

  4. Run the gpseginstall utility referencing the hostfile_exkeys file you just created. This example runs the utility as root. The utility creates the Greenplum operating system user account gpadmin as a system account on all hosts and sets the account password to changeme for that user on all segment hosts.
    # gpseginstall -f hostfile_exkeys

    Use the -u and -p options to specify a different operating system account name and password. See gpseginstall for option information and running the utility as a non-root user.

Recommended security best practices:
  • Do not use the default password option for production environments.
  • Change the password immediately after installation.

Confirming Your Installation

To make sure the Greenplum software was installed and configured correctly, run the following confirmation steps from your Greenplum master host. If necessary, correct any problems before continuing on to the next task.

  1. Log in to the master host as gpadmin:
    $ su - gpadmin
  2. Source the path file from Greenplum Database installation directory:
    # source /usr/local/greenplum-db/greenplum_path.sh
  3. Use the gpssh utility to see if you can login to all hosts without a password prompt, and to confirm that the Greenplum software was installed on all hosts. Use the hostfile_exkeys file you used for installation. For example:
    $ gpssh -f hostfile_exkeys -e ls -l $GPHOME

    If the installation was successful, you can log in to all hosts without a password prompt. All hosts should show that they have the same contents in their installation directories, and that the directories are owned by the gpadmin user.

    If you are prompted for a password, run the following command to redo the ssh key exchange:

    $ gpssh-exkeys -f hostfile_exkeys

Installing Oracle Compatibility Functions

Optional. Many Oracle Compatibility SQL functions are available in Greenplum Database. These functions target PostgreSQL.

Before using any Oracle Compatibility Functions, you need to run the installation script $GPHOME/share/postgresql/contrib/orafunc.sql once for each database. For example, to install the functions in database testdb, use the command

$ psql -d testdb -f $GPHOME/share/postgresql/contrib/orafunc.sql

To uninstall Oracle Compatibility Functions, use the script:

$GPHOME/share/postgresql/contrib/uninstall_orafunc.sql
Note: The following functions are available by default and can be accessed without running the Oracle Compatibility installer: sinh, tanh, cosh and decode.

For more information about Greenplum's Oracle compatibility functions, see "Oracle Compatibility Functions" in the Greenplum Database Utility Guide.

Installing Optional Modules

dblink

The PostgreSQL dblink module provides simple connections to other Greenplum Database databases from installations with the same major version number residing either on the same database host, or on a remote host. Greenplum Database provides dblink support for database users to perform short ad hoc queries in other databases. It is not intended as a replacement for external tables or administrative tools such as gpcopy.

Before you can use dblink functions, run the installation script $GPHOME/share/postgresql/contrib/dblink.sql in each database where you want the ability to query other databases:
$ psql -d testdb -f $GPHOME/share/postgresql/contrib/dblink.sql

See dblink Functions for basic information about using dblink to query other databases. See dblink in the PostgreSQL documentation for more information about individual functions.

pgcrypto

Greenplum Database is installed with an optional module of encryption/decryption functions called pgcrypto. The pgcrypto functions allow database administrators to store certain columns of data in encrypted form. This adds an extra layer of protection for sensitive data, as data stored in Greenplum Database in encrypted form cannot be read by anyone who does not have the encryption key, nor can it be read directly from the disks.

Note: The pgcrypto functions run inside the database server, which means that all the data and passwords move between pgcrypto and the client application in clear-text. For optimal security, consider also using SSL connections between the client and the Greenplum master server.
Before you can use pgcrypto functions, run the installation script $GPHOME/share/postgresql/contrib/pgcrypto.sql in each database where you want the ability to query other databases:
$ psql -d testdb -f $GPHOME/share/postgresql/contrib/pgcrypto.sql

See pgcrypto in the PostgreSQL documentation for more information about individual functions.

Installing Greenplum Database Extensions

Optional. Use the Greenplum package manager (gppkg) to install Greenplum Database extensions such as PL/Java, PL/R, PostGIS, and MADlib, along with their dependencies, across an entire cluster. The package manager also integrates with existing scripts so that any packages are automatically installed on any new hosts introduced into the system following cluster expansion or segment host recovery.

See gppkg for more information, including usage.

Extension packages can be downloaded from the Greenplum Database page on Pivotal Network. The extension documentation in the Greenplum Database Reference Guide contains information about installing extension packages and using extensions.
Important: If you intend to use an extension package with Greenplum Database 5.0.0 you must install and use a Greenplum Database extension package (gppkg files and contrib modules) that is built for Greenplum Database 5.0.0. Any custom modules that were used with earlier versions must be rebuilt for use with Greenplum Database 5.0.0.

Installing and Configuring the Greenplum Platform Extension Framework (PXF)

Optional. If you do not plan to use PXF, no action is necessary.

If you plan to use PXF, refer to Installing and Configuring PXF for instructions.

Creating the Data Storage Areas

Every Greenplum Database master and segment instance has a designated storage area on disk that is called the data directory location. This is the file system location where the directories that store segment instance data will be created. The master host needs a data storage location for the master data directory. Each segment host needs a data directory storage location for its primary segments, and another for its mirror segments.

Creating a Data Storage Area on the Master Host

A data storage area is required on the Greenplum Database master host to store Greenplum Database system data such as catalog data and other system metadata.

To create the data directory location on the master

The data directory location on the master is different than those on the segments. The master does not store any user data, only the system catalog tables and system metadata are stored on the master instance, therefore you do not need to designate as much storage space as on the segments.

  1. Create or choose a directory that will serve as your master data storage area. This directory should have sufficient disk space for your data and be owned by the gpadmin user and group. For example, run the following commands as root:
    # mkdir /data/master
  2. Change ownership of this directory to the gpadmin user. For example:
    # chown gpadmin /data/master
  3. Using gpssh, create the master data directory location on your standby master as well. For example:
    # source /usr/local/greenplum-db-5.0.x.x/greenplum_path.sh 
    # gpssh -h smdw -e 'mkdir /data/master'
    # gpssh -h smdw -e 'chown gpadmin /data/master'

Creating Data Storage Areas on Segment Hosts

Data storage areas are required on the Greenplum Database segment hosts for primary segments. Separate storage areas are required for mirror segments.

To create the data directory locations on all segment hosts

  1. On the master host, log in as root:
    # su
  2. Create a file called hostfile_gpssh_segonly. This file should have only one machine configured host name for each segment host. For example, if you have three segment hosts:
    sdw1
    sdw2
    sdw3
  3. Using gpssh, create the primary and mirror data directory locations on all segment hosts at once using the hostfile_gpssh_segonly file you just created. For example:
    # source /usr/local/greenplum-db-5.0.x.x/greenplum_path.sh 
    # gpssh -f hostfile_gpssh_segonly -e 'mkdir /data/primary'
    # gpssh -f hostfile_gpssh_segonly -e 'mkdir /data/mirror'
    # gpssh -f hostfile_gpssh_segonly -e 'chown gpadmin /data/primary'
    # gpssh -f hostfile_gpssh_segonly -e 'chown gpadmin /data/mirror'

Synchronizing System Clocks

You should use NTP (Network Time Protocol) to synchronize the system clocks on all hosts that comprise your Greenplum Database system. See www.ntp.org for more information about NTP.

NTP on the segment hosts should be configured to use the master host as the primary time source, and the standby master as the secondary time source. On the master and standby master hosts, configure NTP to point to your preferred time server.

To configure NTP

  1. On the master host, log in as root and edit the /etc/ntp.conf file. Set the server parameter to point to your data center's NTP time server. For example (if 10.6.220.20 was the IP address of your data center's NTP server):
    server 10.6.220.20
  2. On each segment host, log in as root and edit the /etc/ntp.conf file. Set the first server parameter to point to the master host, and the second server parameter to point to the standby master host. For example:
    server mdw prefer
    server smdw
  3. On the standby master host, log in as root and edit the /etc/ntp.conf file. Set the first server parameter to point to the primary master host, and the second server parameter to point to your data center's NTP time server. For example:
    server mdw prefer
    server 10.6.220.20
  4. On the master host, use the NTP daemon synchronize the system clocks on all Greenplum hosts. For example, using gpssh:
    # gpssh -f hostfile_gpssh_allhosts -v -e 'ntpd'

Enabling iptables

On Linux systems, you can configure and enable the iptables firewall to work with Greenplum Database.

Note: Greenplum Database performance might be impacted when iptables is enabled. You should test the performance of your application with iptables enabled to ensure that performance is acceptable.

For more information about iptables see the iptables and firewall documentation for your operating system.

How to Enable iptables

  1. As gpadmin, the Greenplum Database administrator, run this command on the Greenplum Database master host to stop Greenplum Database:
    $ gpstop -a
  2. On the Greenplum Database hosts:
    1. Update the file /etc/sysconfig/iptables based on the Example iptables Rules.
    2. As root user, run these commands to enable iptables:
      # chkconfig iptables on
      # service iptables start
  3. As gpadmin, run this command on the Greenplum Database master host to start Greenplum Database:
    $ gpstart -a
Warning: After enabling iptables, this error in the /var/log/messages file indicates that the setting for the iptables table is too low and needs to be increased.
ip_conntrack: table full, dropping packet.

As root user, run this command to view the iptables table value:

# sysctl net.ipv4.netfilter.ip_conntrack_max

The following is the recommended setting to ensure that the Greenplum Database workload does not overflow the iptables table. The value might need to be adjusted for your hosts: net.ipv4.netfilter.ip_conntrack_max=6553600

You can update /etc/sysctl.conf file with the value. For setting values in the file, see Setting the Greenplum Recommended OS Parameters.

To set the value until the next reboots run this command as root.

# sysctl net.ipv4.netfilter.ip_conntrack_max=6553600

Example iptables Rules

When iptables is enabled, iptables manages the IP communication on the host system based on configuration settings (rules). The example rules are used to configure iptables for Greenplum Database master host, standby master host, and segment hosts.

The two sets of rules account for the different types of communication Greenplum Database expects on the master (primary and standby) and segment hosts. The rules should be added to the /etc/sysconfig/iptables file of the Greenplum Database hosts. For Greenplum Database, iptables rules should allow the following communication:

  • For customer facing communication with the Greenplum Database master, allow at least postgres and 28080 (eth1 interface in the example).
  • For Greenplum Database system interconnect, allow communication using tcp, udp, and icmp protocols (eth4 and eth5 interfaces in the example).

    The network interfaces that you specify in the iptables settings are the interfaces for the Greenplum Database hosts that you list in the hostfile_gpinitsystem file. You specify the file when you run the gpinitsystem command to intialize a Greenplum Database system. See Initializing a Greenplum Database System for information about the hostfile_gpinitsystem file and the gpinitsystem command.

In the iptables file, each append rule command (lines starting with -A) is a single line.

The example rules should be adjusted for your configuration. For example:

  • The append command, the -A lines and connection parameter -i should match the connectors for your hosts.
  • the CIDR network mask information for the source parameter -s should match the IP addresses for your network.

Example Master and Standby Master iptables Rules

Example iptables rules with comments for the /etc/sysconfig/iptables file on the Greenplum Database master host and standby master host.

*filter
# Following 3 are default rules. If the packet passes through
# the rule set it gets these rule.
# Drop all inbound packets by default.
# Drop all forwarded (routed) packets.
# Let anything outbound go through.
:INPUT DROP [0:0]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [0:0]
# Accept anything on the loopback interface.
-A INPUT -i lo -j ACCEPT
# If a connection has already been established allow the
# remote host packets for the connection to pass through.
-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
# These rules let all tcp and udp through on the standard
# interconnect IP addresses and on the interconnect interfaces.
# NOTE: gpsyncmaster uses random tcp ports in the range 1025 to 65535
# and Greenplum Database uses random udp ports in the range 1025 to 65535.
-A INPUT -i eth4 -p udp -s 192.0.2.0/22 -j ACCEPT
-A INPUT -i eth5 -p udp -s 198.51.100.0/22 -j ACCEPT
-A INPUT -i eth4 -p tcp -s 192.0.2.0/22 -j ACCEPT --syn -m state --state NEW
-A INPUT -i eth5 -p tcp -s 198.51.100.0/22 -j ACCEPT --syn -m state --state NEW
# Allow ssh on all networks (This rule can be more strict).
-A INPUT -p tcp --dport ssh -j ACCEPT --syn -m state --state NEW
# Allow Greenplum Database on all networks.
-A INPUT -p tcp --dport postgres -j ACCEPT --syn -m state --state NEW
# Allow Greenplum Command Center on the customer facing network.
-A INPUT -i eth1 -p tcp --dport 28080 -j ACCEPT --syn -m state --state NEW
# Allow ping and any other icmp traffic on the interconnect networks.
-A INPUT -i eth4 -p icmp -s 192.0.2.0/22 -j ACCEPT
-A INPUT -i eth5 -p icmp -s 198.51.100.0/22 -j ACCEPT
# Log an error if a packet passes through the rules to the default
# INPUT rule (a DROP).
-A INPUT -m limit --limit 5/min -j LOG --log-prefix "iptables denied: " --log-level 7
COMMIT

Example Segment Host iptables Rules

Example iptables rules for the /etc/sysconfig/iptables file on the Greenplum Database segment hosts. The rules for segment hosts are similar to the master rules with fewer interfaces and fewer udp and tcp services.

*filter
:INPUT DROP
:FORWARD DROP
:OUTPUT ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
-A INPUT -i eth2 -p udp -s 192.0.2.0/22 -j ACCEPT
-A INPUT -i eth3 -p udp -s 198.51.100.0/22 -j ACCEPT
-A INPUT -i eth2 -p tcp -s 192.0.2.0/22 -j ACCEPT --syn -m state --state NEW
-A INPUT -i eth3 -p tcp -s 198.51.100.0/22 -j ACCEPT --syn -m state --state NEW
-A INPUT -i eth0 -p udp --dport snmp -s 203.0.113.0/21 -j ACCEPT
-A INPUT -i eth0 -p tcp --dport snmp -j ACCEPT --syn -m state --state NEW
-A INPUT -p tcp --dport ssh -j ACCEPT --syn -m state --state NEW
-A INPUT -i eth2 -p icmp -s 192.0.2.0/22 -j ACCEPT
-A INPUT -i eth3 -p icmp -s 198.51.100.0/22 -j ACCEPT
-A INPUT -i eth0 -p icmp --icmp-type echo-request -s 203.0.113.0/21 -j ACCEPT
-A INPUT -m limit --limit 5/min -j LOG --log-prefix "iptables denied: " --log-level 7
COMMIT

Next Steps

After you have configured the operating system environment and installed the Greenplum Database software on all of the hosts in the system, the next steps are: