Greenplum PL/Python Language Extension
Greenplum PL/Python Language Extension
This section contains an overview of the Greenplum Database PL/Python Language.
About Greenplum PL/Python
PL/Python is a loadable procedural language. With the Greenplum Database PL/Python extension, you can write a Greenplum Database user-defined functions in Python that take advantage of Python features and modules to quickly build robust database applications.
The Greenplum Database PL/Python extension is installed by default with Greenplum Database. Greenplum Database installs a version of Python and PL/Python. This is location of the Python installation that Greenplum Database uses:
Greenplum Database PL/Python Limitations
- Greenplum Database does not support PL/Python triggers.
- PL/Python is available only as a Greenplum Database untrusted language.
Enabling and Removing PL/Python support
The PL/Python language is installed with Greenplum Database. To create and run a PL\Python user-defined function (UDF) in a database, you must register the PL\Python language with the database.
Enabling PL/Python Support
For each database that requires its use, register the PL/Python language with the SQL command CREATE LANGUAGE or the Greenplum Database utility createlang. For example, running this command as the gpadmin system user registers PL/Python for the database testdb:
$ createlang plpythonu -d testdb
PL/Python is registered as an untrusted language.
Removing PL/Python Support
For a database that no long requires the PL/Python language, remove support for PL/Python with the SQL command DROP LANGUAGE or the Greenplum Database droplang utility. For example, running this command run as the gpadmin system user removes support for PL/Python from the database testdb:
$ droplang plpythonu -d testdb
When you remove support for PL/Python, the PL/Python user-defined functions that you created in the database will no longer work.
Developing Functions with PL/Python
The body of a PL/Python user-defined function is a Python script. When the function is called, its arguments are passed as elements of the array args. Named arguments are also passed as ordinary variables to the Python script. The result is returned from the PL/Python function with return statement, or yield statement in case of a result-set statement.
The Greenplum Database PL/Python language module imports the Python module plpy. The module plpy implements these functions:
- Functions to execute SQL queries and prepare execution plans for queries.
- Functions to manage errors and messages.
Executing and Preparing SQL Queries
The PL/Python plpy module provides two Python functions to execute an SQL query and prepare an execution plan for a query, plpy.execute and plpy.prepare. Preparing the execution plan for a query is useful if you run the query from multiple Python function.
- nrows that returns the number of rows returned by the query.
- status which is the SPI_execute() return value.
For example, this Python statement in a PL/Python user-defined function executes a query.
rv = plpy.execute("SELECT * FROM my_table", 5)
The plpy.execute function returns up to 5 rows from my_table. The result set is stored in the rv object. If my_table has a column my_column, it would be accessed as:
my_col_data = rv[i]["my_column"]
Since the function returns a maximum of 5 rows, the index i can be an integer between 0 and 4.
The function plpy.prepare prepares the execution plan for a query. It is called with a query string and a list of parameter types, if you have parameter references in the query. For example, this statement can be in a PL/Python user-defined function:
plan = plpy.prepare("SELECT last_name FROM my_users WHERE first_name = $1", [ "text" ])
The string text is the data type of the variable that is passed for the variable $1. After preparing a statement, you use the function plpy.execute to run it:
rv = plpy.execute(plan, [ "Fred" ], 5)
The third argument is the limit for the number of rows returned and is optional.
When you prepare an execution plan using the PL/Python module the plan is automatically saved. See the Postgres Server Programming Interface (SPI) documentation for information about the execution plans http://www.postgresql.org/docs/8.2/static/spi.html.
To make effective use of saved plans across function calls you use one of the Python persistent storage dictionaries SD or GD.
The global dictionary SD is available to store data between function calls. This variable is private static data. The global dictionary GD is public data, available to all Python functions within a session. Use GD with care.
Each function gets its own execution environment in the Python interpreter, so that global data and function arguments from myfunc are not available to myfunc2. The exception is the data in the GD dictionary, as mentioned previously.
This example uses the SD dictionary:
CREATE FUNCTION usesavedplan() RETURNS trigger AS $$ if SD.has_key("plan"): plan = SD["plan"] else: plan = plpy.prepare("SELECT 1") SD["plan"] = plan # rest of function $$ LANGUAGE plpythonu;
Handling Python Errors and Messages
The message functions plpy.error and plpy.fatal raise a Python exception which, if uncaught, propagates out to the calling query, causing the current transaction or subtransaction to be aborted. The functions raise plpy.ERROR(msg) and raise plpy.FATAL(msg) are equivalent to calling plpy.error and plpy.fatal, respectively. The other message functions only generate messages of different priority levels.
Whether messages of a particular priority are reported to the client, written to the server log, or both is controlled by the Greenplum Database server configuration parameters log_min_messages and client_min_messages. For information about the parameters see the Greenplum Database Reference Guide.
Using the dictionary GD To Improve PL/Python Performance
In terms of performance, importing a Python module is an expensive operation and can affect performance. If you are importing the same module frequently, you can use Python global variables to load the module on the first invocation and not require importing the module on subsequent calls. The following PL/Python function uses the GD persistent storage dictionary to avoid importing a module if it has already been imported and is in the GD.
psql=# CREATE FUNCTION pytest() returns text as $$ if 'mymodule' not in GD: import mymodule GD['mymodule'] = mymodule return GD['mymodule'].sumd([1,2,3]) $$;
Installing Python Modules
When you install a Python module on Greenplum Database, the Greenplum Database Python environment must have the module added to it across all segment hosts and mirror hosts in the cluster. When expanding Greenplum Database, you must add the Python modules to the new segment hosts. You can use the Greenplum Database utilities gpssh and gpscp run commands on Greenplum Database hosts and copy files to the hosts. For information about the utilities, see the Greenplum Database Utility Guide.
As part of the Greenplum Database installation, the gpadmin user environment is configured to use Python that is installed with Greenplum Database.
To check the Python environment, you can use the which command:
The command returns the location of the Python installation. The Python installed with Greenplum Database is in the Greenplum Database ext/python directory.
If you are building a Python module, you must ensure that the build creates the correct executable. For example on a Linux system, the build should create a 64-bit executable.
Before building a Python module to be installed, ensure that the appropriate software to build the module is installed and properly configured. The build environment is required only on the host where you build the module.
Simple Python Module Installation Example (setuptools)
This example manually installs the Python setuptools module from the Python Package Index repository. The module lets you easily download, build, install, upgrade, and uninstall Python packages.
This example first builds the module from a package and installs the module on a single host. Then the module is built and installed on segment hosts.
- Get the module package from the Python Package Index site. For example, run this
wget command on a Greenplum Database host as the
gpadmin user to get the tar
wget --no-check-certificate https://pypi.python.org/packages/source/s/setuptools/setuptools-18.4.tar.gz
- Extract the files from the tar file.
tar -xzvf setuptools-18.4.tar.gz
- Go to the directory that contains the package files and run the Python scripts to
build and install the Python
cd setuptools-18.4 python setup.py build && python setup.py install
- The following Python command returns no errors if the module is available to Python.
python -c "import setuptools"
- Copy the package to the Greenplum Database hosts with the gpscp
utility. For example, this command copies the tar file from the current host to the host
systems listed in the file
gpscp -f remote-hosts setuptools-18.4.tar.gz =:/home/gpadmin
- Run the commands to build, install, and test the package with gpssh
utility on the hosts listed in the file remote-hosts. The file
remote-hosts lists all the remote Greenplum Database segment
gpssh -f remote_hosts >>> tar -xzvf setuptools-18.4.tar.gz >>> cd setuptools-18.4 >>> python setup.py build && python setup.py install >>> python -c "import setuptools" >>> exit
You can use the gpssh utility to run the easy-install command on all the Greenplum Database segment hosts.
Complex Python Installation Example (NumPy)
This example builds and installs the Python module NumPy. NumPy is a module for scientific computing with Python. For information about NumPy, see http://www.numpy.org/.
- OpenBLAS libraries, an open source implementation of BLAS (Basic Linear Algebra Subprograms).
- The gcc compilers: gcc, gcc-gfortran, and gcc-c++. The compilers are required to build the OpenBLAS libraries. See OpenBLAS Prerequisites
This example process assumes yum is installed on the Greenplum Database segment hosts and the gpadmin user is a member of sudoers with root privileges on the hosts.
wget --directory-prefix=packages http://github.com/xianyi/OpenBLAS/tarball/v0.2.8 wget --directory-prefix=packages http://sourceforge.net/projects/numpy/files/NumPy/1.8.0/numpy-1.8.0.tar.gz/download
gpssh -f gpdb_remotes mkdir packages gpscp -f gpdb_remotes packages/* =:/home/gpadmin/packages
sudo yum -y install gcc gcc-gfortran gcc-c++
wget http://gfortran.com/download/x86_64/snapshots/gcc-4.4.tar.xz tar xf gcc-4.4.tar.xz -C /usr/local/
export PATH=$PATH:/usr/local/gcc-4.4/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/gcc-4.4/lib
Create a symbolic link to g++ and call it gxx
sudo ln -s /usr/bin/g++ /usr/bin/gxx
You might also need to create symbolic links to any libraries that have different versions available for example libppl_c.so.4 to libppl_c.so.2.
If needed, you can use the gpscp utility to copy files to Greenplum Database hosts and the gpssh utility to run commands on the hosts.
Build and Install OpenBLAS Libraries
Before build and install the NumPy module, you install the OpenBLAS libraries. This section describes how to build and install the libraries on a single host.
- Extract the OpenBLAS files from the file. These commands extract the files from the
OpenBLAS tar file and simplify the directory name that contains the OpenBLAS
tar -xzf packages/v0.2.8 -C /home/gpadmin/packages mv /home/gpadmin/packages/xianyi-OpenBLAS-9c51cdf /home/gpadmin/packages/OpenBLAS
- Compile OpenBLAS. These commands set the LIBRARY_PATH environment
variable and run the make command to build OpenBLAS libraries.
cd /home/gpadmin/packages/OpenBLAS export LIBRARY_PATH=$LD_LIBRARY_PATH make FC=gfortran USE_THREAD=0
- These commands install the OpenBLAS libraries in /usr/local as root
and change the owner of the files to
cd /home/gpadmin/packages/OpenBLAS/ sudo make PREFIX=/usr/local install sudo ldconfig sudo chown -R gpadmin /usr/local/lib
libopenblas.a -> libopenblas_sandybridge-r0.2.8.a libopenblas_sandybridge-r0.2.8.a libopenblas_sandybridge-r0.2.8.so libopenblas.so -> libopenblas_sandybridge-r0.2.8.so libopenblas.so.0 -> libopenblas_sandybridge-r0.2.8.so
You can use the gpssh utility to build and install the OpenBLAS libraries on multiple hosts.
All the Greenplum Database hosts (master and segment hosts) have identical configurations. You can copy the OpenBLAS libraries from the system where they were built instead of building the OpenBlas libraries on all the hosts. For example, these gpssh and gpscp commands copy and install the OpenBLAS libraries on the hosts listed in the gpdb_remotes file.
gpssh -f gpdb_remotes -e 'sudo yum -y install gcc gcc-gfortran gcc-c++' gpssh -f gpdb_remotes -e 'ln -s /usr/bin/g++ /usr/bin/gxx' gpssh -f gpdb_remotes -e sudo chown gpadmin /usr/local/lib gpscp -f gpdb_remotes /usr/local/lib/libopen*sandy* =:/usr/local/lib gpssh -f gpdb_remotes >>> cd /usr/local/lib >>> ln -s libopenblas_sandybridge-r0.2.8.a libopenblas.a >>> ln -s libopenblas_sandybridge-r0.2.8.so libopenblas.so >>> ln -s libopenblas_sandybridge-r0.2.8.so libopenblas.so.0 >>> sudo ldconfig
Build and Install NumPy
After you have installed the OpenBLAS libraries, you can build and install NumPy module. These steps install the NumPy module on a single host. You can use the gpssh utility to build and install the NumPy module on multiple hosts.
- Go to the packages subdirectory and get the NumPy module source and
cd /home/gpadmin/packages tar -xzf numpy-1.8.0.tar.gz
- Set up the environment for building and installing NumPy.
export BLAS=/usr/local/lib/libopenblas.a export LAPACK=/usr/local/lib/libopenblas.a export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/ export LIBRARY_PATH=$LD_LIBRARY_PATH
- Go to the NumPy directory and build and install NumPy. Building the NumPy package
might take some
cd numpy-1.8.0 python setup.py build python setup.py installNote: If the NumPy module did not successfully build, the NumPy build process might need a site.cfg that specifies the location of the OpenBLAS libraries. Create the file site.cfg in the NumPy package directory:
cd ~/packages/numpy-1.8.0 touch site.cfg
Add the following to the site.cfg file and run the NumPy build command again:
[default] library_dirs = /usr/local/lib [atlas] atlas_libs = openblas library_dirs = /usr/local/lib [lapack] lapack_libs = openblas library_dirs = /usr/local/lib # added for scikit-learn [openblas] libraries = openblas library_dirs = /usr/local/lib include_dirs = /usr/local/include
- The following Python command ensures that the module is available for import by Python
on a host system.
python -c "import numpy"
As in the simple module installation, you can use the gpssh utility to build, install, and test the module on Greenplum Database segment hosts.
echo -e '\n#Needed for NumPy' >> ~/.bashrc echo -e 'export BLAS=/usr/local/lib/libopenblas.a' >> ~/.bashrc echo -e 'export LAPACK=/usr/local/lib/libopenblas.a' >> ~/.bashrc echo -e 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib' >> ~/.bashrc echo -e 'export LIBRARY_PATH=$LD_LIBRARY_PATH' >> ~/.bashrc
Testing Installed Python Modules
You can create a simple PL/Python user-defined function (UDF) to validate that Python a module is available in the Greenplum Database. This example tests the NumPy module.
This PL/Python UDF imports the NumPy module. The function returns SUCCESS if the module is imported, and FAILURE if an import error occurs.
CREATE OR REPLACE FUNCTION plpy_test(x int) returns text as $$ try: from numpy import * return 'SUCCESS' except ImportError, e: return 'FAILURE' $$ language plpythonu;
Create a table that contains data on each Greenplum Database segment instance. Depending on the size of your Greenplum Database installation, you might need to generate more data to ensure data is distributed to all segment instances.
CREATE TABLE DIST AS (SELECT x FROM generate_series(1,50) x ) DISTRIBUTED RANDOMLY ;
This SELECT command runs the UDF on the segment hosts where data is stored in the primary segment instances.
SELECT gp_segment_id, plpy_test(x) AS status FROM dist GROUP BY gp_segment_id, status ORDER BY gp_segment_id, status;
The SELECT command returns SUCCESS if the UDF imported the Python module on the Greenplum Database segment instance. If the SELECT command returns FAILURE, you can find the segment host of the segment instance host. The Greenplum Database system table gp_segment_configuration contains information about mirroring and segment configuration. This command returns the host name for a segment ID.
SELECT hostname, content AS seg_ID FROM gp_segment_configuration WHERE content = seg_id ;
If FAILURE is returned, these are some possible causes:
- A problem accessing required libraries. For the NumPy example, a Greenplum Database
might have a problem accessing the OpenBLAS libraries or the Python libraries on a
segment host.Make sure you get no errors when running command on the segment host as the gpadmin user. This gpssh command tests importing the numpy module on the segment host mdw1.
gpssh -h mdw1 python -c "import numpy"
- If the Python import command does not return an error, environment
variables might not be configured in the Greenplum Database environment. For example,
the variables are not in the .bashrc file, or Greenplum Database might
not have been restarted after adding the environment variables to the
Ensure sure that the environment variables are properly set and then restart the Greenplum Database. For the NumPy example, ensure the environment variables listed at the end of the section Build and Install NumPy are defined in the .bashrc file for the gpadmin user on the master and segment hosts.Note: On the Greenplum Database master and segment hosts, the .bashrc file for the gpadmin user must source the file $GPHOME/greenplum_path.sh.
This PL/Python UDF returns the maximum of two integers:
CREATE FUNCTION pymax (a integer, b integer) RETURNS integer AS $$ if (a is None) or (b is None): return None if a > b: return a return b $$ LANGUAGE plpythonu;
You can use the STRICT property to perform the null handling instead of using the two conditional statements.
CREATE FUNCTION pymax (a integer, b integer) RETURNS integer AS $$ return max(a,b) $$ LANGUAGE plpythonu STRICT ;
You can run the user-defined function pymax with SELECT command. This example runs the UDF and shows the output.
SELECT ( pymax(123, 43)); column1 --------- 123 (1 row)
This example that returns data from an SQL query that is run against a table. These two commands create a simple table and add data to the table.
CREATE TABLE sales (id int, year int, qtr int, day int, region text) DISTRIBUTED BY (id) ; INSERT INTO sales VALUES (1, 2014, 1,1, 'usa'), (2, 2002, 2,2, 'europe'), (3, 2014, 3,3, 'asia'), (4, 2014, 4,4, 'usa'), (5, 2014, 1,5, 'europe'), (6, 2014, 2,6, 'asia'), (7, 2002, 3,7, 'usa') ;
This PL/Python UDF executes a SELECT command that returns 5 rows from the table. The Python function returns the REGION value from the row specified by the input value. In the Python function, the row numbering starts from 0. Valid input for the function is an integer between 0 and 4.
CREATE OR REPLACE FUNCTION mypytest(a integer) RETURNS text AS $$ rv = plpy.execute("SELECT * FROM sales ORDER BY id", 5) region = rv[a]["region"] return region $$ language plpythonu;
Running this SELECT statement returns the REGION column value from the third row of the result set.
SELECT mypytest(2) ;
This command deletes the UDF from the database.
DROP FUNCTION mypytest(integer) ;
For information about PL/Python see the PostgreSQL documentation at http://www.postgresql.org/docs/8.2/static/plpython.html.
For information about Python Package Index (PyPI), see https://pypi.python.org/pypi.
- SciPy library provides user-friendly and efficient numerical routines such as
routines for numerical integration and optimization http://www.scipy.org/scipylib/index.html. This wget command
downloads the SciPy package tar
- Natural Language Toolkit (nltk) is a platform for building Python programs to work
with human language data. http://www.nltk.org/. This wget command
downloads the nltk package tar
wget http://pypi.python.org/packages/source/n/nltk/nltk-2.0.2.tar.gz#md5=6e714ff74c3398e88be084748df4e657Note: The Python package Distribute https://pypi.python.org/pypi/distribute is required for nltk. The Distribute module should be installed the ntlk package. This wget command downloads the Distribute package tar file.
For information about the Python language, see http://www.python.org/.
A set of slides that were used in a talk about how the Pivotal data science team uses the PyData stack in the Pivotal MPP databases and on Pivotal Cloud Foundry http://www.slideshare.net/SrivatsanRamanujam/all-thingspythonpivotal.