Configuring Your Systems and Installing Greenplum
Configuring Your Systems and Installing Greenplum
This chapter describes how to prepare your operating system environment for Greenplum, and install the Greenplum Database software binaries on all of the hosts that will comprise your Greenplum Database system. Perform the following tasks in order:
- Make sure your systems meet the System Requirements
- Setting the Greenplum Recommended OS Parameters
- (master only) Running the Greenplum Installer
- Installing and Configuring Greenplum on all Hosts
- (Optional) Installing Oracle Compatibility Functions
- (Optional) Installing Greenplum Database Extensions
- Creating the Data Storage Areas
- Synchronizing System Clocks
- Next Steps
Unless noted, these tasks should be performed for all hosts in your Greenplum Database array (master, standby master and segments).
You can install and configure Greenplum Database on virtual servers provided by the Amazon Elastic Compute Cloud (Amazon EC2) web service. For information about using Greenplum Database in an Amazon EC2 environment, see Amazon EC2 Configuration.
The following table lists minimum recommended specifications for servers intended to support Greenplum Database on Linux systems in a production environment. Greenplum also provides hardware build guides for its certified hardware platforms. It is recommended that you work with a Greenplum Systems Engineer to review your anticipated environment to ensure an appropriate hardware configuration for Greenplum Database.
|Operating System||SUSE Linux Enterprise Server 11 SP2
CentOS 5.0 or higher
Red Hat Enterprise Linux (RHEL) 5.0 or higher
Oracle Unbreakable Linux 5.5
Note: See the Greenplum Database Release Notes for current supported platform information.
|Minimum CPU||Pentium Pro compatible (P3/Athlon and above)|
|Minimum Memory||16 GB RAM per server|
|Network Requirements||10 Gigabit Ethernet within the array
Dedicated, non-blocking switch
|Software and Utilities||bash shell
GNU sed (used by Greenplum Database gpinitsystem)
This information describes how to check the status and disable SELinux, iptables, and firewalld features.
- This command checks the status of SELinux when run as
# sestatus SELinuxstatus: disabled
You can disable SELinux by editing the /etc/selinux/config file. As root, change the value of the SELINUX parameter in the config file and reboot the system:
- This command checks the status of iptables when run as
# /sbin/chkconfig --list iptables
This is the output if iptables is disabled.
iptables 0:off 1:off 2:off 3:off 4:off 5:off 6:off
One method of disabling iptables is to become root, run this command, and then reboot the system:
/sbin/chkconfig iptables off
This command checks the status of firewalld when run as root:
# systemctl status firewalld.service
This is the output if firewalld is disabled.
* firewalld.service - firewalld - dynamic firewall daemon Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled) Active: inactive (dead)
These commands disable firewalld when run as root:
# systemctl stop firewalld.service # systemctl disable firewalld.service
Setting the Greenplum Recommended OS Parameters
Greenplum requires the certain Linux operating system (OS) parameters be set on all hosts in your Greenplum Database system (masters and segments).
In general, the following categories of system parameters need to be altered:
- Shared Memory - A Greenplum Database instance will not work unless the shared memory segment for your kernel is properly sized. Most default OS installations have the shared memory values set too low for Greenplum Database. On Linux systems, you must also disable the OOM (out of memory) killer. For information about Greenplum Database shared memory requirements, see the Greenplum Database server configuration parameter shared_buffers in the Greenplum Database Reference Guide.
- Network - On high-volume Greenplum Database systems, certain network-related tuning parameters must be set to optimize network connections made by the Greenplum interconnect.
- User Limits - User limits control the resources available to processes started by a user's shell. Greenplum Database requires a higher limit on the allowed number of file descriptors that a single process can have open. The default settings may cause some Greenplum Database queries to fail because they will run out of file descriptors needed to process the query.
Linux System Settings
- Edit the /etc/hosts file and make sure that it includes the host names and all interface address names for every machine participating in your Greenplum Database system.
- Set the following parameters in the /etc/sysctl.conf
file and reboot:
kernel.shmmax = 500000000 kernel.shmmni = 4096 kernel.shmall = 4000000000 kernel.sem = 250 512000 100 2048 kernel.sysrq = 1 kernel.core_uses_pid = 1 kernel.msgmnb = 65536 kernel.msgmax = 65536 kernel.msgmni = 2048 net.ipv4.tcp_syncookies = 1 net.ipv4.conf.default.accept_source_route = 0 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_max_syn_backlog = 4096 net.ipv4.conf.all.arp_filter = 1 net.ipv4.ip_local_port_range = 10000 65535 net.core.netdev_max_backlog = 10000 net.core.rmem_max = 2097152 net.core.wmem_max = 2097152 vm.overcommit_memory = 2Note: When vm.overcommit_memory is 2, you specify a value for vm.overcommit_ratio. For information about calculating the value for vm.overcommit_ratio, see the Greenplum Database server configuration parameter gp_vmem_protect_limit in the Greenplum Database Reference Guide.
To avoid port conflicts between Greenplum Database and other applications, when initializing Greenplum Database, do not specify Greenplum Database ports in the range specified by the operating system parameter net.ipv4.ip_local_port_range. For example, if net.ipv4.ip_local_port_range = 10000 65535, you could set the Greenplum Database base port numbers to these values.
PORT_BASE = 6000 MIRROR_PORT_BASE = 7000 REPLICATION_PORT_BASE = 8000 MIRROR_REPLICATION_PORT_BASE = 9000
For information about port ranges that are used by Greenplum Database, see gpinitsystem.
- Set the following parameters in the
* soft nofile 65536 * hard nofile 65536 * soft nproc 131072 * hard nproc 131072
For Red Hat Enterprise Linux (RHEL) 6.x and 7.x and CentOS 6.x and 7.x, parameter values in the /etc/security/limits.d/NN-nproc.conf file override the values in the limits.conf file. If a parameter value is set in both conf files, ensure that the parameter is set properly in the NN-nproc.conf file. The Linux module pam_limits sets user limits by reading the values from the limits.conf file and then from the NN-nproc.conf file. For information about PAM and user limits, see the documentation on PAM and pam_limits.
- Each disk device file should have a read-ahead
(blockdev) value of 16384.
To verify the read-ahead value of a disk device:
# /sbin/blockdev --getra devname
# /sbin/blockdev --getra /dev/sdb
To set blockdev (read-ahead) on a device:
# /sbin/blockdev --setra bytes devname
# /sbin/blockdev --setra 16384 /dev/sdb
See the manual page (man) for the blockdev command for more information about using that command (man blockdev opens the man page).
- XFS is the preferred file system on Linux platforms for data storage.
The following XFS mount options are
rw,nodev,noatime,nobarrier,inode64For RHEL 5.x, CentOS 5.x, and SUSE versions 11.3 and earlier, the allocsize=16m option is also recommended.
See the manual page (man) for the mount command for more information about using that command (man mount opens the man page).
The XFS options can also be set in the /etc/fstab file. This example entry from an fstab file of a CentOS 7 system specifies the XFS options.
/dev/data /data xfs nodev,noatime,nobarrier,inode64 0 0
This example entry from an fstab file includes the allocsize=16m option for RHEL 5.x, CentOS 5.x, and SUSE versions 11.3 and earlier.
/dev/data /data xfs nodev,noatime,nobarrier,inode64,allocsize=16m 0 0
- The Linux disk I/O scheduler for disk access supports different
policies, such as CFQ, AS, and
The deadline scheduler option is recommended. To specify a scheduler until the next system reboot, run the following:
# echo schedulername > /sys/block/devname/queue/scheduler
# echo deadline > /sys/block/sbd/queue/scheduler
You can specify the I/O scheduler at boot time with the elevator kernel parameter. Add the parameter elevator=deadline to the kernel command in the file /boot/grub/grub.conf, the GRUB boot loader configuration file. This is an example kernel command from a grub.conf file on RHEL 6.x or CentOS 6.x. The command is on multiple lines for readability.
kernel /vmlinuz-2.6.18-274.3.1.el5 ro root=LABEL=/ elevator=deadline crashkernel=128M@16M quiet console=tty1 console=ttyS1,115200 panic=30 transparent_hugepage=never initrd /initrd-2.6.18-274.3.1.el5.imgTo specify the I/O scheduler at boot time on systems that use grub2 such as RHEL 7.x or CentOS 7.x, use the system utility grubby. This command adds the parameter when run as root.
# grubby --update-kernel=ALL --args="elevator=deadline"
After adding the parameter, reboot the system.This grubby command displays kernel parameter settings.
# grubby --info=ALL
For more information about the grubby utility, see your operating system documentation. If the grubby command does not update the kernels, see the Note at the end of the section.
- Disable Transparent Huge Pages (THP). RHEL 6.0 or higher enables THP by default. THP
degrades Greenplum Database performance. One way to disable THP on RHEL 6.x is by adding
the parameter transparent_hugepage=never to the
kernel command in the file /boot/grub/grub.conf,
the GRUB boot loader configuration file. This is an example kernel
command from a grub.conf file. The command is on multiple lines for
kernel /vmlinuz-2.6.18-274.3.1.el5 ro root=LABEL=/ elevator=deadline crashkernel=128M@16M quiet console=tty1 console=ttyS1,115200 panic=30 transparent_hugepage=never initrd /initrd-2.6.18-274.3.1.el5.imgOn systems that use grub2 such as RHEL 7.x or CentOS 7.x, use the system utility grubby. This command adds the parameter when run as root.
# grubby --update-kernel=ALL --args="transparent_hugepage=never"
After adding the parameter, reboot the system.This cat command checks the state of THP. The output indicates that THP is disabled.This grubby command displays kernel parameter settings.
$ cat /sys/kernel/mm/*transparent_hugepage/enabled always [never]
# grubby --info=ALL
For more information about Transparent Huge Pages or the grubby utility, see your operating system documentation. If the grubby command does not update the kernels, see the Note at the end of the section.
- Disable IPC object removal for RHEL 7.2 or CentOS 7.2 releases or higher. The default
systemd setting RemoveIPC=yes removes IPC
connections when non-system users log out. This causes the Greenplum Database utility
gpinitsystem to fail with semaphore errors. Perform one of the
following to avoid this issue.
- Create the gpadmin user as a system account. For the useradd command, the -r option creates a user as a system user and the -m option creates a home directory for the user.
- Disable RemoveIPC. Set this parameter in
/etc/systemd/logind.conf on the Greenplum Database host systems.
The setting takes effect after restarting the systemd-login service or rebooting the system. To restart the service, run this command as the root user.
service systemd-logind restart
- Add the parameter to the GRUB_CMDLINE_LINUX line in the file
GRUB_TIMEOUT=5 GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)" GRUB_DEFAULT=saved GRUB_DISABLE_SUBMENU=true GRUB_TERMINAL_OUTPUT="console" GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet transparent_hugepage=never" GRUB_DISABLE_RECOVERY="true"
- As root, run the grub2-mkconfig command to update the
# grub2-mkconfig -o /boot/grub2/grub.cfg
- Reboot the system.
Running the Greenplum Installer
To configure your systems for Greenplum Database, you will need certain utilities found in $GPHOME/bin of your installation. Log in as root and run the Greenplum installer on the machine that will be your master host.
The gNet extension package contains the software for the gphdfs protocol. For Greenplum Database 4.3.1 and later releases, the extension is bundled with Greenplum Database. The files for gphdfs are installed in $GPHOME/lib/hadoop.
To install the Greenplum binaries on the master host
- Download or copy the installer file to the machine that will be the Greenplum Database master host. Installer files are available from Greenplum for Red Hat (32-bit and 64-bit), and SuSe Linux 64-bit platforms.
- Unzip the installer file where PLATFORM is either
RHEL5-i386 (Red Hat 32-bit), RHEL5-x86_64 (Red Hat
64-bit), or SuSE10-x86_64 (SuSe Linux 64 bit). For
# unzip greenplum-db-4.3.x.x-PLATFORM.zip
- Launch the installer using bash. For
# /bin/bash greenplum-db-4.3.x.x-PLATFORM.bin
- The installer will prompt you to accept the Greenplum Database license agreement. Type yes to accept the license agreement.
- The installer will prompt you to provide an installation path. Press ENTER to accept the default install path (/usr/local/greenplum-db-4.3.x.x), or enter an absolute path to an install location. You must have write permissions to the location you specify.
- Optional. The installer will prompt you to provide the path to a
previous installation of Greenplum Database. For example:
This installation step will migrate any Greenplum Database add-on modules (postgis, pgcrypto, etc.) from the previous installation path to the path to the version currently being installed. This step is optional and can be performed manually at any point after the installation using the gppkg utility with the -migrate option. See gppkg for details.
Press ENTER to skip this step.
- The installer will install the Greenplum software and create a greenplum-db symbolic link one directory level above your version-specific Greenplum installation directory. The symbolic link is used to facilitate patch maintenance and upgrades between versions. The installed location is referred to as $GPHOME.
- To perform additional required system configuration tasks and to install Greenplum Database on other hosts, go to the next task Installing and Configuring Greenplum on all Hosts.
About Your Greenplum Database Installation
- greenplum_path.sh — This file contains the environment variables for Greenplum Database. See Setting Greenplum Environment Variables.
- GPDB-LICENSE.txt — Greenplum license agreement.
- bin — This directory contains the Greenplum Database management utilities. This directory also contains the PostgreSQL client and server programs, most of which are also used in Greenplum Database.
- demo — This directory contains the Greenplum demonstration programs.
- docs — The Greenplum Database documentation (PDF files).
- etc — Sample configuration file for OpenSSL.
- ext — Bundled programs (such as Python) used by some Greenplum Database utilities.
- include — The C header files for Greenplum Database.
- lib — Greenplum Database and PostgreSQL library files.
- sbin — Supporting/Internal scripts and programs.
- share — Shared files for Greenplum Database.
Installing and Configuring Greenplum on all Hosts
When run as root, gpseginstall copies the Greenplum Database installation from the current host and installs it on a list of specified hosts, creates the Greenplum system user (gpadmin), sets the system user's password (default is changeme), sets the ownership of the Greenplum Database installation directory, and exchanges ssh keys between all specified host address names (both as root and as the specified system user).
When a Greenplum Database system is first initialized, the system contains one predefined superuser role (also referred to as the system user), gpadmin. This is the user who owns and administers the Greenplum Database.
To install and configure Greenplum Database on all specified hosts
- Log in to the master host as
$ su -
- Source the path file from your master host's Greenplum Database
# source /usr/local/greenplum-db/greenplum_path.sh
- Create a file called hostfile_exkeys that has the
machine configured host names and host addresses (interface names) for each host in
your Greenplum system (master, standby master and segments). Make sure there are no
blank lines or extra spaces. For example, if you have a master, standby master and
three segments with two network interfaces per host, your file would look something
mdw mdw-1 mdw-2 smdw smdw-1 smdw-2 sdw1 sdw1-1 sdw1-2 sdw2 sdw2-1 sdw2-2 sdw3 sdw3-1 sdw3-2
Check the /etc/hosts file on your systems for the correct host names to use for your environment.
The Greenplum Database segment host naming convention is sdwN where sdw is a prefix and N is an integer. For example, on a Greenplum Database DCA system, segment host names would be sdw1, sdw2 and so on. For hosts with multiple interfaces, the convention is to append a dash (-) and number to the host name. For example, sdw1-1 and sdw1-2 are the two interface names for host sdw1.
- Run the gpseginstall utility referencing the
hostfile_exkeys file you just created. Use the -u
and -p options to create the Greenplum system user
(gpadmin) on all hosts and set the password for that user on all
# gpseginstall -f hostfile_exkeys -u gpadmin -p changeme
Recommended security best practices:
- Do not use the default password option for production environments.
- Change the password immediately after installation.
Confirming Your Installation
To make sure the Greenplum software was installed and configured correctly, run the following confirmation steps from your Greenplum master host. If necessary, correct any problems before continuing on to the next task.
- Log in to the master host as
$ su - gpadmin
- Source the path file from Greenplum Database installation
# source /usr/local/greenplum-db/greenplum_path.sh
- Use the gpssh utility to see if you can login to all
hosts without a password prompt, and to confirm that the Greenplum software was
installed on all hosts. Use the hostfile_exkeys file you used for
$ gpssh -f hostfile_exkeys -e ls -l $GPHOME
If the installation was successful, you should be able to log in to all hosts without a password prompt. All hosts should show that they have the same contents in their installation directories, and that the directories are owned by the gpadmin user.
If you are prompted for a password, run the following command to redo the ssh key exchange:
$ gpssh-exkeys -f hostfile_exkeys
Installing Oracle Compatibility Functions
Optional. Many Oracle Compatibility SQL functions are available in Greenplum Database. These functions target PostgreSQL.
Before using any Oracle Compatibility Functions, you need to run the installation script $GPHOME/share/postgresql/contrib/orafunc.sql once for each database. For example, to install the functions in database testdb, use the command
$ psql -d testdb -f $GPHOME/share/postgresql/contrib/orafunc.sql
To uninstall Oracle Compatibility Functions, use the script:
For more information about Greenplum's Oracle compatibility functions, see "Oracle Compatibility Functions" in the Greenplum Database Utility Guide.
Installing Greenplum Database Extensions
Optional. Use the Greenplum package manager (gppkg) to install Greenplum Database extensions such as pgcrypto, PL/R, PL/Java, PL/Perl, and PostGIS, along with their dependencies, across an entire cluster. The package manager also integrates with existing scripts so that any packages are automatically installed on any new hosts introduced into the system following cluster expansion or segment host recovery.
See gppkg for more information, including usage.
Creating the Data Storage Areas
Every Greenplum Database master and segment instance has a designated storage area on disk that is called the data directory location. This is the file system location where the directories that store segment instance data will be created. The master host needs a data storage location for the master data directory. Each segment host needs a data directory storage location for its primary segments, and another for its mirror segments.
Creating a Data Storage Area on the Master Host
A data storage area is required on the Greenplum Database master host to store Greenplum Database system data such as catalog data and other system metadata.
To create the data directory location on the master
The data directory location on the master is different than those on the segments. The master does not store any user data, only the system catalog tables and system metadata are stored on the master instance, therefore you do not need to designate as much storage space as on the segments.
- Create or choose a directory that will serve as your master data
storage area. This directory should have sufficient disk space for your data and be
owned by the gpadmin user and group. For example, run the following
commands as root:
# mkdir /data/master
- Change ownership of this directory to the gpadmin
user. For example:
# chown gpadmin /data/master
- Using gpssh, create the master data directory location on your standby
master as well. For
# source /usr/local/greenplum-db-4.3.x.x/greenplum_path.sh # gpssh -h smdw -e 'mkdir /data/master' # gpssh -h smdw -e 'chown gpadmin /data/master'
Creating Data Storage Areas on Segment Hosts
Data storage areas are required on the Greenplum Database segment hosts for primary segments. Separate storage areas are required for mirror segments.
To create the data directory locations on all segment hosts
- On the master host, log in as
- Create a file called hostfile_gpssh_segonly. This
file should have only one machine configured host name for each segment host. For
example, if you have three segment hosts:
sdw1 sdw2 sdw3
- Using gpssh, create the primary and mirror data directory locations on
all segment hosts at once using the hostfile_gpssh_segonly file you
just created. For
# source /usr/local/greenplum-db-4.3.x.x/greenplum_path.sh # gpssh -f hostfile_gpssh_segonly -e 'mkdir /data/primary' # gpssh -f hostfile_gpssh_segonly -e 'mkdir /data/mirror' # gpssh -f hostfile_gpssh_segonly -e 'chown gpadmin /data/primary' # gpssh -f hostfile_gpssh_segonly -e 'chown gpadmin /data/mirror'
Synchronizing System Clocks
You should use NTP (Network Time Protocol) to synchronize the system clocks on all hosts that comprise your Greenplum Database system. See www.ntp.org for more information about NTP.
NTP on the segment hosts should be configured to use the master host as the primary time source, and the standby master as the secondary time source. On the master and standby master hosts, configure NTP to point to your preferred time server.
To configure NTP
- On the master host, log in as root and edit the
/etc/ntp.conf file. Set the server parameter to
point to your data center's NTP time server. For example (if
10.6.220.20 was the IP address of your data center's NTP
- On each segment host, log in as root and edit the
/etc/ntp.conf file. Set the first server parameter
to point to the master host, and the second server parameter to point to the standby
master host. For example:
server mdw prefer server smdw
- On the standby master host, log in as root and edit the
/etc/ntp.conf file. Set the first server parameter
to point to the primary master host, and the second server parameter to point to your
data center's NTP time server. For
server mdw prefer server 10.6.220.20
- On the master host, use the NTP daemon synchronize the system clocks on
all Greenplum hosts. For example, using gpssh:
# gpssh -f hostfile_gpssh_allhosts -v -e 'ntpd'
On Linux systems, you can configure and enable the iptables firewall to work with Greenplum Database.
For more information about iptables see the iptables and firewall documentation for your operating system.
How to Enable iptables
- As gpadmin, the Greenplum Database administrator, run this command on
the Greenplum Database master host to stop Greenplum
$ gpstop -a
- On the Greenplum Database hosts:
- Update the file /etc/sysconfig/iptables based on the Example iptables Rules.
- As root user, run these commands to enable
# chkconfig iptables on # service iptables start
- As gpadmin, run this command on the Greenplum Database master host to
start Greenplum Database:
$ gpstart -a
ip_conntrack: table full, dropping packet.
As root user, run this command to view the iptables table value:
# sysctl net.ipv4.netfilter.ip_conntrack_max
The following is the recommended setting to ensure that the Greenplum Database workload does not overflow the iptables table. The value might need to be adjusted for your hosts: net.ipv4.netfilter.ip_conntrack_max=6553600
You can update /etc/sysctl.conf file with the value. For setting values in the file, see Setting the Greenplum Recommended OS Parameters.
To set the value until the next reboots run this command as root.
# sysctl net.ipv4.netfilter.ip_conntrack_max=6553600
Example iptables Rules
When iptables is enabled, iptables manages the IP communication on the host system based on configuration settings (rules). The example rules are used to configure iptables for Greenplum Database master host, standby master host, and segment hosts.
The two sets of rules account for the different types of communication Greenplum Database expects on the master (primary and standby) and segment hosts. The rules should be added to the /etc/sysconfig/iptables file of the Greenplum Database hosts. For Greenplum Database, iptables rules should allow the following communication:
- For customer facing communication with the Greenplum Database master, allow at least postgres and 28080 (eth1 interface in the example).
- For Greenplum Database system interconnect, allow communication using
tcp, udp, and icmp protocols
(eth4 and eth5 interfaces in the example).
The network interfaces that you specify in the iptables settings are the interfaces for the Greenplum Database hosts that you list in the hostfile_gpinitsystem file. You specify the file when you run the gpinitsystem command to intialize a Greenplum Database system. See Initializing a Greenplum Database System for information about the hostfile_gpinitsystem file and the gpinitsystem command.
- For the administration network on a Greenplum DCA, allow communication using ssh, snmp, ntp, and icmp protocols. (eth0 interface in the example).
In the iptables file, each append rule command (lines starting with -A) is a single line.
The example rules should be adjusted for your configuration. For example:
- The append command, the -A lines and connection parameter -i should match the connectors for your hosts.
- the CIDR network mask information for the source parameter -s should match the IP addresses for your network.
Example Master and Standby Master iptables Rules
Example iptables rules with comments for the /etc/sysconfig/iptables file on the Greenplum Database master host and standby master host.
*filter # Following 3 are default rules. If the packet passes through # the rule set it gets these rule. # Drop all inbound packets by default. # Drop all forwarded (routed) packets. # Let anything outbound go through. :INPUT DROP [0:0] :FORWARD DROP [0:0] :OUTPUT ACCEPT [0:0] # Accept anything on the loopback interface. -A INPUT -i lo -j ACCEPT # If a connection has already been established allow the # remote host packets for the connection to pass through. -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT # These rules let all tcp and udp through on the standard # interconnect IP addresses and on the interconnect interfaces. # NOTE: gpsyncmaster uses random tcp ports in the range 1025 to 65535 # and Greenplum Database uses random udp ports in the range 1025 to 65535. -A INPUT -i eth4 -p udp -s 192.0.2.0/22 -j ACCEPT -A INPUT -i eth5 -p udp -s 198.51.100.0/22 -j ACCEPT -A INPUT -i eth4 -p tcp -s 192.0.2.0/22 -j ACCEPT --syn -m state --state NEW -A INPUT -i eth5 -p tcp -s 198.51.100.0/22 -j ACCEPT --syn -m state --state NEW # Allow snmp connections on the admin network on Greenplum DCA. -A INPUT -i eth0 -p udp --dport snmp -s 203.0.113.0/21 -j ACCEPT -A INPUT -i eth0 -p tcp --dport snmp -s 203.0.113.0/21 -j ACCEPT --syn -m state --state NEW # Allow udp/tcp ntp connections on the admin network on Greenplum DCA. -A INPUT -i eth0 -p udp --dport ntp -s 203.0.113.0/21 -j ACCEPT -A INPUT -i eth0 -p tcp --dport ntp -s 203.0.113.0/21 -j ACCEPT --syn -m state --state NEW # Allow ssh on all networks (This rule can be more strict). -A INPUT -p tcp --dport ssh -j ACCEPT --syn -m state --state NEW # Allow Greenplum Database on all networks. -A INPUT -p tcp --dport postgres -j ACCEPT --syn -m state --state NEW # Allow Greenplum Command Center on the customer facing network. -A INPUT -i eth1 -p tcp --dport 28080 -j ACCEPT --syn -m state --state NEW # Allow ping and any other icmp traffic on the interconnect networks. -A INPUT -i eth4 -p icmp -s 192.0.2.0/22 -j ACCEPT -A INPUT -i eth5 -p icmp -s 198.51.100.0/22 -j ACCEPT # Allow ping only on the admin network on Greenplum DCA. -A INPUT -i eth0 -p icmp --icmp-type echo-request -s 203.0.113.0/21 -j ACCEPT # Log an error if a packet passes through the rules to the default # INPUT rule (a DROP). -A INPUT -m limit --limit 5/min -j LOG --log-prefix "iptables denied: " --log-level 7 COMMIT
Example Segment Host iptables Rules
Example iptables rules for the /etc/sysconfig/iptables file on the Greenplum Database segment hosts. The rules for segment hosts are similar to the master rules with fewer interfaces and fewer udp and tcp services.
*filter :INPUT DROP :FORWARD DROP :OUTPUT ACCEPT -A INPUT -i lo -j ACCEPT -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT -A INPUT -i eth2 -p udp -s 192.0.2.0/22 -j ACCEPT -A INPUT -i eth3 -p udp -s 198.51.100.0/22 -j ACCEPT -A INPUT -i eth2 -p tcp -s 192.0.2.0/22 -j ACCEPT --syn -m state --state NEW -A INPUT -i eth3 -p tcp -s 198.51.100.0/22 -j ACCEPT --syn -m state --state NEW -A INPUT -i eth0 -p udp --dport snmp -s 203.0.113.0/21 -j ACCEPT -A INPUT -i eth0 -p tcp --dport snmp -j ACCEPT --syn -m state --state NEW -A INPUT -p tcp --dport ssh -j ACCEPT --syn -m state --state NEW -A INPUT -i eth2 -p icmp -s 192.0.2.0/22 -j ACCEPT -A INPUT -i eth3 -p icmp -s 198.51.100.0/22 -j ACCEPT -A INPUT -i eth0 -p icmp --icmp-type echo-request -s 203.0.113.0/21 -j ACCEPT -A INPUT -m limit --limit 5/min -j LOG --log-prefix "iptables denied: " --log-level 7 COMMIT
Amazon EC2 Configuration (Amazon Web Services)
You can install and configure Greenplum Database on virtual servers provided by the Amazon Elastic Compute Cloud (Amazon EC2) web service. Amazon EC2 is a service provided by Amazon Web Services (AWS). The following overview information describes how to install Greenplum Database in an Amazon EC2 environment.
About Amazon EC2
You can use Amazon EC2 to launch as many virtual servers as you need, configure security and networking, and manage storage. An EC2 instance is a virtual server in the AWS cloud virtual computing environment.
EC2 instances are manged by AWS. AWS isolates your EC2 instances from other users in a virtual private cloud (VPC) and lets you control access to the instances. You can configure instance features such as operating system, network connectivity (network ports and protocols, IP address access), access to the to the Internet, and size and type of disk storage.
When you launch an instance, you use a preconfigured template for your instance, known as an Amazon Machine Image (AMI). The AMI packages the bits you need for your server (including the operating system and additional software). You can use images supplied by Amazon or use customized images. You launch instances in an Availability Zone of an AWS region. An Availability Zone is distinct location within a region that are engineered to be insulated from failures in other Availability Zones.
For information about Amazon EC2, see http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html
Launching EC2 Instances with the EC2 Console
You can launch instances, configure, start, stop, and terminate (delete) virtual servers, with the Amazon EC2 Console. When you launch instances, you select these features.
- Amazon Machine Image (AMI)
- An AMI is a template that contains the software configuration (operating system, application server, and applications).
- For Greenplum Database - Select an AMI that runs a supported operating system. See the Greenplum Database Release Notes for the release that you are installing.
Note: You create and launch a customized AMI, see About Amazon Machine Image (AMI)
- EC2 Instance Type
- A predefined set performance characteristics. Instance types comprise varying combinations of CPU, memory, default storage, and networking capacity. You can modify storage options when you add storage.
- For Greenplum Database - The instance type must be an EBS-Optimized instance type when using Amazon EBS storage for Greenplum Database. See Configure storage for information about Greenplum Database storage requirements. For information about EBS-Optimized instances, see the Amazon documentation about EBS-Optimized instances.
- For sufficient network performance, the instance type must also support EC2 enhanced networking. For information about EC2 enhanced networking, see the Amazon documentation about Enhanced Networking on Linux Instances.
- The instances should be in a single VPC and subnet. Instances are always assigned a VPC internal IP address and can be assigned a public IP address for external and Internet access.
- The internal IP address is used for Greenplum Database communication between hosts. You can also use the internal IP address to access an instance from another instance within the EC2 VPC. For information about configuring launched instances to communicate with each other, see Working with EC2 Instances.
- A public IP address for the instance and an Internet gateway configured for the EC2 VPC are required for accessing the instance from an external source and for the instance to access the Internet. Internet access is required when installing Linux packages. When you launch a set of instances, you can enable or disable the automatic assignment of public IP addresses when the instances are started.
- If automatic assignment of public IP addresses is enabled, instances are always assigned a public IP address when the instance starts. If automatic assignment of public IP addresses is disabled, you can associate a public IP address with the EC2 elastic IP, and temporarily associate public IP addresses to an instance to connect and configure the instance.
- To control whether a public IP is assigned when launching an instance, see the Amazon documentation about Subnet Public IP Addressing.
- EC2 Instance Details
- Information about starting, running, and stopping EC2 instances, such as such as number of instances of the same AMI, network information, and EC2 VPC and subnet membership.
- Configure storage
- Adjust and add storage. For example, you can change the size of root volume and add volumes.
- For Greenplum Database - Greenplum Database supports either EC2 instance store or Amazon EBS storage in a production environment.
- EC2 instance store provides temporary block-level storage. This storage is located on disks that are physically attached to the host computer. With instance store, powering off the instance causes data loss. Soft reboots preserve instance store data. However, EC2 instance store can provide higher and more consistent I/O performance.
EBS storage provides block level storage volumes with long-term
persistence. For EBS storage, the storage must be RAID of Amazon EBS volumes and
mounted with the XFS file system for it to be a supported configuration. All
other file systems are explicitly not supported by Pivotal.
There are several classes of EBS. For Greenplum Database, select the EBS volume type gp2 or io1. See the Amazon documentation about Block Device Mapping.
For more information about the Amazon storage types, see Notes.
- Create Tag
- An optional label that consists of a case-sensitive key-value pair that is used for organizing searching a large number of EC2 resources.
- Security Group
- A set of firewall rules that control the network traffic for instances.
- For external access to an instance with ssh, create a rule that enables ssh for inbound network traffic.
Working with EC2 Instances
After the EC2 instances have started, you connect to and configure the instances. The Instances page of the EC2 Console lists the running instances and network information. If the instance does not have a public IP address, you can create an Elastic IP and associate it with the instance. See About Amazon Elastic IP Addresses.
To access EC2 instances, AWS uses public-key cryptography to secure the login information for your instance. A Linux instance has no password; you use a key pair to log in to your instance securely. You specify the name of the key pair when you launch your instance, then provide the private key when you log in using SSH. See the Amazon documentation about EC2 Key Pairs.
A key pair consists of a public key that AWS stores, and a private key file that you store. Together, they allow you to connect to your instance securely.
This example logs into an into EC2 instance from an external location with the private key file my-test.pem and user ec2-user. The user ec2-user is the default user for some Linux AMI templates. This example assumes that the instance is configured with a public IP address 192.0.2.82 and that the pem file is the private key file that is used to access the instance.
ssh -i my-test.pem email@example.com
cp my-test.pem ~/.ssh/id_rsa chmod 400 .ssh/id_rsa
scp ~/.ssh/id_rsa firstname.lastname@example.org:~/.ssh/id_rsa
This example logs into an into EC2 instance using the id_rsa file.
After the key file is installed on all Greenplum Database hosts you can use Greenplum Database utilities such as gpseginstall, gpssh, and gpscp that access multiple Greenplum Database hosts.
Before installing Greenplum Database, you configure the EC2 instances as you would a local host server machines. Configure the host operating system, configure host network information (for example, update the /etc/hosts file), set operating system parameters, and install operating system packages. For information about how to prepare your operating system environment for Greenplum Database, see Configuring Your Systems and Installing Greenplum.
sudo yum install -y sed sudo yum install -y unzip sudo yum install -y vim
scp greenplum-db-126.96.36.199-build-1-RHEL5-x86_64.zip email@example.com:~/.
ssh firstname.lastname@example.org unzip greenplum-db-188.8.131.52-build-1-RHEL5-x86_64.zip ./greenplum-db-184.108.40.206-build-1-RHEL5-x86_64.bin
gpseginstall -u ec2-user -f my-hosts
About Amazon Machine Image (AMI)
An Amazon Machine Image (AMI) is a template that contains a software configuration (for example, an operating system, an application server, and applications). From an AMI, you launch an instance, which is a copy of the AMI running as a virtual server in the cloud. You can launch multiple instances of an AMI.
After you launch an instance, it acts like a traditional host, and you can interact with it as you would any computer. You have complete control of your instances; you can use sudo to run commands that require root privileges.
You can create a customized Amazon EBS-backed Linux AMI from an instance that you've launched from an existing Amazon EBS-backed Linux AMI. After you've customized the instance to suit your needs, create and register a new AMI, which you can use to launch new instances with these customizations.
For information about AMI, see the Amazon documentation about AMIs.
About Amazon Elastic IP Addresses
An EC2 Elastic IP address is a public IP address that you can allocate (create) for your account. You can associate it to and disassociate it from instances as you require, and it's allocated to your account until you choose to release it.
Your default VPC is configured with an Internet gateway. When you allocate an EC2 Elastic IP address, AWS configures the VPC to allow internet access to the IP address using the gateway.
To enable an instance in your VPC to communicate with the Internet, it must have a public IP address or an EC2 Elastic IP address that's associated with a private IP address on your instance.
To ensure that your instances can communicate with the Internet, you must also attach an Internet gateway to your EC2 VPC. For information about VPC Internet Gateways, see the Amazon documentation about Internet gateways.
For information about EC2 Elastic IP addresses and how to use them, see see the Amazon documentation about Elastic IP Addresses.
- The Greenplum Database utility gpssh-extkey is not used with Greenplum Database hosts that are EC2 instances. You must copy the private key file to the .ssh directory in the user home directory for each instance.
- When you use Amazon EBS storage for Greenplum Database storage, the storage should be
RAID of Amazon EBS volumes and mounted with the XFS file system for it to be a supported
For information about EBS storage, see the Amazon documentation about Amazon EBS. Also, see the Amazon EC2 documentation for configuring the Amazon EBS volumes and managing storage and file systems used with EC2 instances.
- For an EC2 instance with instance store, the virtual devices for instance store
volumes are ephemeralN (n is between 0 and 23). On
an instance running CentOS the instance store block device names appear as
Two examples of EC2 instance types that were configured with instance store and that showed acceptable performance are the d2.8xlarge instance type configured with four raid0 volumes of 6 disks each, and the i2.8xlarge instance type configured with two raid0 volumes of 4 disks.
For information about EC2 instance store, see the Amazon documentation about EC2 Instance Store.
- These are default ports in a Greenplum Database environment. These ports need to be
open in the security group to allow access from a source external to a VPC.
Port Used by this application 22 ssh - connect to host with ssh 5432 Greenplum Database (master) 28080 Greenplum Command Center
- For a non-default VPC you can configure the VPC with an internet gateway for the VPC and allocate Elastic IP address for the VPC. AWS will automatically configure the Elastic IP for internet access. For information about EC2 internet gateways, see the Amazon documentation about Internet Gateways.
- A placement group is a logical grouping of instances within a single
Availability Zone. Using placement groups enables applications to participate in a
low-latency, 10 Gbps network. Placement groups are recommended for applications that
benefit from low network latency, high network throughput, or both.
Placement Groups provide the ability for EC2 instances to separated from other instances. However, configuring instances are in different placement groups can improve performance but might create a configuration where an instance in a placement group cannot be replaced.
See the Amazon documentation about Placement Groups.
- Amazon EC2 provides enhanced networking capabilities using single root I/O
virtualization (SR-IOV) on some instance types. Enabling enhanced networking on your
instance results in higher performance (packets per second), lower latency, and lower
To enable enhanced networking on your Red Hat and CentOS RHEL/CentOS instance, you must ensure that the kernel has the ixgbevf module version 2.14.2 or higher is installed and that the sriovNetSupport attribute is set.
For information about EC2 enhanced networking, see the Amazon documentation about Enhanced Networking on Linux Instances.
References to AWS and EC2 features and related information.
- AWS - https://aws.amazon.com/.
- Connecting to an EC2 instance - http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html.
- Amazon VPC - http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Introduction.html.
- Amazon EC2 best practices - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-best-practices.html
- Amazon EC2 and VPC command line interface (CLI) - http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/Welcome.html.
- Amazon S3 (secure, durable, scalable object storage) - https://aws.amazon.com/s3/.
- AWS CloudFormation - https://aws.amazon.com/cloudformation/.
With AWS CloudFormation, you can create templates to simplify provisioning and management of related AWS resources.