Using Functions and Operators

Using Functions and Operators

Greenplum Database evaluates functions and operators used in SQL expressions. Some functions and operators are only allowed to execute on the master since they could lead to inconsistencies in segment databases.

Using Functions in Greenplum Database

Table 1. Functions in Greenplum Database
Function Type Greenplum Support Description Comments
IMMUTABLE Yes Relies only on information directly in its argument list. Given the same argument values, always returns the same result.  
STABLE Yes, in most cases Within a single table scan, returns the same result for same argument values, but results change across SQL statements. Results depend on database lookups or parameter values. current_timestamp family of functions is STABLE; values do not change within an execution.
VOLATILE Restricted Function values can change within a single table scan. For example: random(), currval(), timeofday(). Any function with side effects is volatile, even if its result is predictable. For example: setval().

In Greenplum Database, data is divided up across segments — each segment is a distinct PostgreSQL database. To prevent inconsistent or unexpected results, do not execute functions classified as VOLATILE at the segment level if they contain SQL commands or modify the database in any way. For example, functions such as setval() are not allowed to execute on distributed data in Greenplum Database because they can cause inconsistent data between segment instances.

To ensure data consistency, you can safely use VOLATILE and STABLE functions in statements that are evaluated on and run from the master. For example, the following statements run on the master (statements without a FROM clause):

SELECT setval('myseq', 201);
SELECT foo();

If a statement has a FROM clause containing a distributed table and the function in the FROM clause returns a set of rows, the statement can run on the segments:

SELECT * from foo();

Greenplum Database does not support functions that return a table reference (rangeFuncs) or functions that use the refCursor datatype.

User-Defined Functions

Greenplum Database supports user-defined functions. See Extending SQL in the PostgreSQL documentation for more information.

Use the CREATE FUNCTION statement to register user-defined functions that are used as described in Using Functions in Greenplum Database. By default, user-defined functions are declared as VOLATILE, so if your user-defined function is IMMUTABLE or STABLE, you must specify the correct volatility level when you register your function.

When you create user-defined functions, avoid using fatal errors or destructive calls. Greenplum Database may respond to such errors with a sudden shutdown or restart.

In Greenplum Database, the shared library files for user-created functions must reside in the same library path location on every host in the Greenplum Database array (masters, segments, and mirrors).

Built-in Functions and Operators

The following table lists the categories of built-in functions and operators supported by PostgreSQL. All functions and operators are supported in Greenplum Database as in PostgreSQL with the exception of STABLE and VOLATILE functions, which are subject to the restrictions noted in Using Functions in Greenplum Database. See the Functions and Operators section of the PostgreSQL documentation for more information about these built-in functions and operators.

Table 2. Built-in functions and operators
Operator/Function Category VOLATILE Functions STABLE Functions Restrictions
Logical Operators      
Comparison Operators      
Mathematical Functions and Operators random

setseed

   
String Functions and Operators All built-in conversion functions convert

pg_client_encoding

 
Binary String Functions and Operators      
Bit String Functions and Operators      
Pattern Matching      
Data Type Formatting Functions   to_char

to_timestamp

 
Date/Time Functions and Operators timeofday age

current_date

current_time

current_timestamp

localtime

localtimestamp

now

 
Geometric Functions and Operators      
Network Address Functions and Operators      
Sequence Manipulation Functions currval

lastval

nextval

setval

   
Conditional Expressions      
Array Functions and Operators   All array functions  
Aggregate Functions      
Subquery Expressions      
Row and Array Comparisons      
Set Returning Functions generate_series    
System Information Functions   All session information functions

All access privilege inquiry functions

All schema visibility inquiry functions

All system catalog information functions

All comment information functions

 
System Administration Functions set_config

pg_cancel_backend

pg_reload_conf

pg_rotate_logfile

pg_start_backup

pg_stop_backup

pg_size_pretty

pg_ls_dir

pg_read_file

pg_stat_file

current_setting

All database object size functions

Note: The function pg_column_size displays bytes required to store the value, perhaps with TOAST compression.
XML Functions   xmlagg(xml)

xmlexists(text, xml)

xml_is_well_formed(text)

xml_is_well_formed_document(text)

xml_is_well_formed_content(text)

xpath(text, xml)

xpath(text, xml, text[])

xpath_exists(text, xml)

xpath_exists(text, xml, text[])

xml(text)

text(xml)

xmlcomment(xml)

xmlconcat2(xml, xml)

 

Window Functions

The following built-in window functions are Greenplum extensions to the PostgreSQL database. All window functions are immutable. For more information about window functions, see Window Expressions.

Table 3. Window functions
Function Return Type Full Syntax Description
cume_dist() double precision CUME_DIST() OVER ( [PARTITION BY expr ] ORDER BY expr ) Calculates the cumulative distribution of a value in a group of values. Rows with equal values always evaluate to the same cumulative distribution value.
dense_rank() bigint DENSE_RANK () OVER ( [PARTITION BY expr ] ORDER BY expr ) Computes the rank of a row in an ordered group of rows without skipping rank values. Rows with equal values are given the same rank value.
first_value(expr) same as input expr type FIRST_VALUE( expr ) OVER ( [PARTITION BY expr ] ORDER BY expr [ROWS|RANGE frame_expr ] ) Returns the first value in an ordered set of values.
lag(expr [,offset] [,default]) same as input expr type LAG( expr [, offset ] [, default ]) OVER ( [PARTITION BY expr ] ORDER BY expr ) Provides access to more than one row of the same table without doing a self join. Given a series of rows returned from a query and a position of the cursor, LAG provides access to a row at a given physical offset prior to that position. The default offset is 1. default sets the value that is returned if the offset goes beyond the scope of the window. If default is not specified, the default value is null.
last_valueexpr same as input expr type LAST_VALUE(expr) OVER ( [PARTITION BY expr] ORDER BY expr [ROWS|RANGE frame_expr] ) Returns the last value in an ordered set of values.
lead(expr [,offset] [,default]) same as input expr type LEAD(expr [,offset] [,exprdefault]) OVER ( [PARTITION BY expr] ORDER BY expr ) Provides access to more than one row of the same table without doing a self join. Given a series of rows returned from a query and a position of the cursor, lead provides access to a row at a given physical offset after that position. If offset is not specified, the default offset is 1. default sets the value that is returned if the offset goes beyond the scope of the window. If default is not specified, the default value is null.
ntile(expr) bigint NTILE(expr) OVER ( [PARTITION BY expr] ORDER BY expr ) Divides an ordered data set into a number of buckets (as defined by expr) and assigns a bucket number to each row.
percent_rank() double precision PERCENT_RANK () OVER ( [PARTITION BY expr] ORDER BY expr ) Calculates the rank of a hypothetical row R minus 1, divided by 1 less than the number of rows being evaluated (within a window partition).
rank() bigint RANK () OVER ( [PARTITION BY expr] ORDER BY expr ) Calculates the rank of a row in an ordered group of values. Rows with equal values for the ranking criteria receive the same rank. The number of tied rows are added to the rank number to calculate the next rank value. Ranks may not be consecutive numbers in this case.
row_number() bigint ROW_NUMBER () OVER ( [PARTITION BY expr] ORDER BY expr ) Assigns a unique number to each row to which it is applied (either each row in a window partition or each row of the query).

Advanced Analytic Functions

The following built-in advanced analytic functions are Greenplum extensions of the PostgreSQL database. Analytic functions are immutable.

Table 4. Advanced Analytic Functions
Function Return Type Full Syntax Description
matrix_add(array[], array[]) smallint[], int[], bigint[], float[] matrix_add( array[[1,1],[2,2]], array[[3,4],[5,6]]) Adds two two-dimensional matrices. The matrices must be conformable.
matrix_multiply( array[], array[]) smallint[]int[], bigint[], float[] matrix_multiply( array[[2,0,0],[0,2,0],[0,0,2]], array[[3,0,3],[0,3,0],[0,0,3]] ) Multiplies two, three- dimensional arrays. The matrices must be conformable.
matrix_multiply( array[], expr) int[], float[] matrix_multiply( array[[1,1,1], [2,2,2], [3,3,3]], 2) Multiplies a two-dimensional array and a scalar numeric value.
matrix_transpose( array[]) Same as input array type. matrix_transpose( array [[1,1,1],[2,2,2]]) Transposes a two-dimensional array.
pinv(array []) smallint[]int[], bigint[], float[] pinv(array[[2.5,0,0],[0,1,0],[0,0,.5]]) Calculates the Moore-Penrose pseudoinverse of a matrix.
unnest (array[]) set of anyelement unnest( array['one', 'row', 'per', 'item']) Transforms a one dimensional array into rows. Returns a set of anyelement, a polymorphic pseudotype in PostgreSQL.
Table 5. Advanced Aggregate Functions
Function Return Type Full Syntax Description
MEDIAN (expr) timestamp, timestampz, interval, float MEDIAN (expression)

Example:

SELECT department_id, MEDIAN(salary) 
FROM employees 
GROUP BY department_id; 
Can take a two-dimensional array as input. Treats such arrays as matrices.
PERCENTILE_CONT (expr) WITHIN GROUP (ORDER BY expr [DESC/ASC]) timestamp, timestampz, interval, float PERCENTILE_CONT(percentage) WITHIN GROUP (ORDER BY expression)

Example:

SELECT department_id,
PERCENTILE_CONT (0.5) WITHIN GROUP (ORDER BY salary DESC)
"Median_cont"; 
FROM employees GROUP BY department_id;
Performs an inverse function that assumes a continuous distribution model. It takes a percentile value and a sort specification and returns the same datatype as the numeric datatype of the argument. This returned value is a computed result after performing linear interpolation. Null are ignored in this calculation.
PERCENTILE_DISC (expr) WITHIN GROUP (ORDER BY expr [DESC/ASC]) timestamp, timestampz, interval, float PERCENTILE_DISC(percentage) WITHIN GROUP (ORDER BY expression)

Example:

SELECT department_id, 
PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY salary DESC)
"Median_desc"; 
FROM employees GROUP BY department_id;
Performs an inverse distribution function that assumes a discrete distribution model. It takes a percentile value and a sort specification. This returned value is an element from the set. Null are ignored in this calculation.
sum(array[]) smallint[]int[], bigint[], float[] sum(array[[1,2],[3,4]])

Example:

CREATE TABLE mymatrix (myvalue int[]);
INSERT INTO mymatrix VALUES (array[[1,2],[3,4]]);
INSERT INTO mymatrix VALUES (array[[0,1],[1,0]]);
SELECT sum(myvalue) FROM mymatrix;
 sum 
---------------
 {{1,3},{4,4}}
Performs matrix summation. Can take as input a two-dimensional array that is treated as a matrix.
pivot_sum (label[], label, expr) int[], bigint[], float[] pivot_sum( array['A1','A2'], attr, value) A pivot aggregation using sum to resolve duplicate entries.
mregr_coef(expr, array[]) float[] mregr_coef(y, array[1, x1, x2]) The four mregr_* aggregates perform linear regressions using the ordinary-least-squares method. mregr_coef calculates the regression coefficients. The size of the return array for mregr_coef is the same as the size of the input array of independent variables, since the return array contains the coefficient for each independent variable.
mregr_r2 (expr, array[]) float mregr_r2(y, array[1, x1, x2]) The four mregr_* aggregates perform linear regressions using the ordinary-least-squares method. mregr_r2 calculates the r-squared error value for the regression.
mregr_pvalues(expr, array[]) float[] mregr_pvalues(y, array[1, x1, x2]) The four mregr_* aggregates perform linear regressions using the ordinary-least-squares method. mregr_pvalues calculates the p-values for the regression.
mregr_tstats(expr, array[]) float[] mregr_tstats(y, array[1, x1, x2]) The four mregr_* aggregates perform linear regressions using the ordinary-least-squares method. mregr_tstats calculates the t-statistics for the regression.
nb_classify(text[], bigint, bigint[], bigint[]) text nb_classify(classes, attr_count, class_count, class_total) Classify rows using a Naive Bayes Classifier. This aggregate uses a baseline of training data to predict the classification of new rows and returns the class with the largest likelihood of appearing in the new rows.
nb_probabilities(text[], bigint, bigint[], bigint[]) text nb_probabilities(classes, attr_count, class_count, class_total) Determine probability for each class using a Naive Bayes Classifier. This aggregate uses a baseline of training data to predict the classification of new rows and returns the probabilities that each class will appear in new rows.

Advanced Analytic Function Examples

These examples illustrate selected advanced analytic functions in queries on simplified example data. They are for the multiple linear regression aggregate functions and for Naive Bayes Classification with nb_classify.

Linear Regression Aggregates Example

The following example uses the four linear regression aggregates mregr_coef, mregr_r2, mregr_pvalues, and mregr_tstats in a query on the example table regr_example. In this example query, all the aggregates take the dependent variable as the first parameter and an array of independent variables as the second parameter.

SELECT mregr_coef(y, array[1, x1, x2]), 
       mregr_r2(y, array[1, x1, x2]),
       mregr_pvalues(y, array[1, x1, x2]),
       mregr_tstats(y, array[1, x1, x2])
from regr_example;

Table regr_example:

 id | y  | x1 | x2
----+----+----+----
  1 |  5 |  2 |  1
  2 | 10 |  4 |  2
  3 |  6 |  3 |  1
  4 |  8 |  3 |  1

Running the example query against this table yields one row of data with the following values:

mregr_coef:

{-7.105427357601e-15,2.00000000000003,0.999999999999943}

mregr_r2:

0.86440677966103

mregr_pvalues:

{0.999999999999999,0.454371051656992,0.783653104061216}

mregr_tstats:

{-2.24693341988919e-15,1.15470053837932,0.35355339059327}

Greenplum Database returns NaN (not a number) if the results of any of these agregates are undefined. This can happen if there is a very small amount of data.

Note:

The intercept is computed by setting one of the independent variables to 1, as shown in the preceding example.

Naive Bayes Classification Examples

The aggregates nb_classify and nb_probabilities are used within a larger four-step classification process that involves the creation of tables and views for training data. The following two examples show all the steps. The first example shows a small data set with arbitrary values, and the second example is the Greenplum implementation of a popular Naive Bayes example based on weather conditions.

Overview

The following describes the Naive Bayes classification procedure. In the examples, the value names become the values of the field attr:

  1. Unpivot the data.

    If the data is not denormalized, create a view with the identification and classification that unpivots all the values. If the data is already in denormalized form, you do not need to unpivot the data.

  2. Create a training table.

    The training table shifts the view of the data to the values of the field attr.

  3. Create a summary view of the training data.
  4. Aggregate the data with nb_classify, nb_probabilities,or both.

Naive Bayes Example 1 – Small Table

This example begins with the normalized data in the example table class_example and proceeds through four discrete steps:

Table class_example:

 id | class | a1 | a2 | a3 
----+-------+----+----+----
  1 | C1    |  1 |  2 |  3
  2 | C1    |  1 |  4 |  3
  3 | C2    |  0 |  2 |  2
  4 | C1    |  1 |  2 |  1
  5 | C2    |  1 |  2 |  2
  6 | C2    |  0 |  1 |  3
  1. Unpivot the data.

    For use as training data, the data in class_example must be unpivoted because the data is in denormalized form. The terms in single quotation marks define the values to use for the new field attr. By convention, these values are the same as the field names in the normalized table. In this example, these values are capitalized to highlight where they are created in the command.

    CREATE view class_example_unpivot AS
    SELECT id, class, unnest(array['A1', 'A2', 'A3']) as attr, 
    unnest(array[a1,a2,a3]) as value FROM class_example; 
    

    The unpivoted view shows the normalized data. It is not necessary to use this view. Use the command SELECT * from class_example_unpivot to see the denormalized data:

     id | class | attr | value
    ----+-------+------+-------
      2 | C1    | A1   |     1
      2 | C1    | A2   |     2
      2 | C1    | A3   |     1
      4 | C2    | A1   |     1
      4 | C2    | A2   |     2
      4 | C2    | A3   |     2
      6 | C2    | A1   |     0
      6 | C2    | A2   |     1
      6 | C2    | A3   |     3
      1 | C1    | A1   |     1
      1 | C1    | A2   |     2
      1 | C1    | A3   |     3
      3 | C1    | A1   |     1
      3 | C1    | A2   |     4
      3 | C1    | A3   |     3
      5 | C2    | A1   |     0
      5 | C2    | A2   |     2
      5 | C2    | A3   |     2
    (18 rows)
    
  2. Create a training table from the unpivoted data.

    The terms in single quotation marks define the values to sum. The terms in the array passed into pivot_sum must match the number and names of classifications in the original data. In the example, C1 and C2:

    CREATE table class_example_nb_training AS
    SELECT attr, value, pivot_sum(array['C1', 'C2'], class, 1) 
    as class_count
    FROM   class_example_unpivot
    GROUP BY attr, value
    DISTRIBUTED by (attr); 
    

    The following is the resulting training table:

     attr | value | class_count
    ------+-------+-------------
     A3   |     1 | {1,0}
     A3   |     3 | {2,1}
     A1   |     1 | {3,1}
     A1   |     0 | {0,2}
     A3   |     2 | {0,2}
     A2   |     2 | {2,2}
     A2   |     4 | {1,0}
     A2   |     1 | {0,1}
    (8 rows)
    
  3. Create a summary view of the training data.
    CREATE VIEW class_example_nb_classify_functions AS
    SELECT attr, value, class_count, array['C1', 'C2'] as classes,
    sum(class_count) over (wa)::integer[] as class_total,
    count(distinct value) over (wa) as attr_count
    FROM class_example_nb_training
    WINDOW wa as (partition by attr);

    The following is the resulting training table:

    attr| value | class_count| classes | class_total |attr_count
    -----+-------+------------+---------+-------------+---------
     A2  |     2 | {2,2}      | {C1,C2} | {3,3}       |         3
     A2  |     4 | {1,0}      | {C1,C2} | {3,3}       |         3
     A2  |     1 | {0,1}      | {C1,C2} | {3,3}       |         3
     A1  |     0 | {0,2}      | {C1,C2} | {3,3}       |         2
     A1  |     1 | {3,1}      | {C1,C2} | {3,3}       |         2
     A3  |     2 | {0,2}      | {C1,C2} | {3,3}       |         3
     A3  |     3 | {2,1}      | {C1,C2} | {3,3}       |         3
     A3  |     1 | {1,0}      | {C1,C2} | {3,3}       |         3
    (8 rows)
    
  4. Classify rows with nb_classify and display the probability with nb_probabilities.

    After you prepare the view, the training data is ready for use as a baseline for determining the class of incoming rows. The following query predicts whether rows are of class C1 or C2 by using the nb_classify aggregate:

    SELECT nb_classify(classes, attr_count, class_count, 
    class_total) as class
    FROM class_example_nb_classify_functions
    where (attr = 'A1' and value = 0) or (attr = 'A2' and value = 
    2) or (attr = 'A3' and value = 1);
    

    Running the example query against this simple table yields one row of data displaying these values:

    This query yields the expected single-row result of C1.

    class 
    -------
    C2
    (1 row)
    

    Display the probabilities for each class with nb_probabilities.

    Once the view is prepared, the system can use the training data as a baseline for determining the class of incoming rows. The following query predicts whether rows are of class C1 or C2 by using the nb_probabilities aggregate:

    SELECT nb_probabilities(classes, attr_count, class_count, 
    class_total) as probability
    FROM class_example_nb_classify_functions
    where (attr = 'A1' and value = 0) or (attr = 'A2' and value = 
    2) or (attr = 'A3' and value = 1);
    

    Running the example query against this simple table yields one row of data displaying the probabilities for each class:

    This query yields the expected single-row result showing two probabilities, the first for C1,and the second for C2.

    probability
    -------------
     {0.4,0.6}
    (1 row)
    

    You can display the classification and the probabilities with the following query.

    SELECT nb_classify(classes, attr_count, class_count, 
    class_total) as class, nb_probabilities(classes, attr_count, 
    class_count, class_total) as probability FROM 
    class_example_nb_classify where (attr = 'A1' and value = 0) 
    or (attr = 'A2' and value = 2) or (attr = 'A3' and value = 
    1); 
    

    This query produces the following result:

  5.  class | probability
    -------+-------------
        C2 | {0.4,0.6}
     (1 row)
    

Actual data in production scenarios is more extensive than this example data and yields better results. Accuracy of classification with nb_classify and nb_probabilities improves significantly with larger sets of training data.

Naive Bayes Example 2 – Weather and Outdoor Sports

This example calculates the probabilities of whether the user will play an outdoor sport, such as golf or tennis, based on weather conditions. The table weather_example contains the example values. The identification field for the table is day. There are two classifications held in the field play: Yes or No. There are four weather attributes, outlook, temperature, humidity, and wind. The data is normalized.

 day | play | outlook  | temperature | humidity | wind
-----+------+----------+-------------+----------+--------
 2   | No   | Sunny    | Hot         | High     | Strong
 4   | Yes  | Rain     | Mild        | High     | Weak
 6   | No   | Rain     | Cool        | Normal   | Strong
 8   | No   | Sunny    | Mild        | High     | Weak
10   | Yes  | Rain     | Mild        | Normal   | Weak
12   | Yes  | Overcast | Mild        | High     | Strong
14   | No   | Rain     | Mild        | High     | Strong
 1   | No   | Sunny    | Hot         | High     | Weak
 3   | Yes  | Overcast | Hot         | High     | Weak
 5   | Yes  | Rain     | Cool        | Normal   | Weak
 7   | Yes  | Overcast | Cool        | Normal   | Strong
 9   | Yes  | Sunny    | Cool        | Normal   | Weak
11   | Yes  | Sunny    | Mild        | Normal   | Strong
13   | Yes  | Overcast | Hot         | Normal   | Weak
 (14 rows)

Because this data is normalized, all four Naive Bayes steps are required.

  1. Unpivot the data.
    CREATE view weather_example_unpivot AS SELECT day, play, 
    unnest(array['outlook','temperature', 'humidity','wind']) as 
    attr, unnest(array[outlook,temperature,humidity,wind]) as 
    value FROM weather_example; 
    

    Note the use of quotation marks in the command.

    The SELECT * from weather_example_unpivot displays the denormalized data and contains the following 56 rows.

     day | play | attr        | value
    -----+------+-------------+----------
       2 | No   | outlook     | Sunny
       2 | No   | temperature | Hot
       2 | No   | humidity    | High
       2 | No   | wind        | Strong
       4 | Yes  | outlook     | Rain
       4 | Yes  | temperature | Mild
       4 | Yes  | humidity    | High
       4 | Yes  | wind        | Weak
       6 | No   | outlook     | Rain
       6 | No   | temperature | Cool
       6 | No   | humidity    | Normal
       6 | No   | wind        | Strong
       8 | No   | outlook     | Sunny
       8 | No   | temperature | Mild
       8 | No   | humidity    | High
       8 | No   | wind        | Weak
      10 | Yes  | outlook     | Rain
      10 | Yes  | temperature | Mild
      10 | Yes  | humidity    | Normal
      10 | Yes  | wind        | Weak
      12 | Yes  | outlook     | Overcast
      12 | Yes  | temperature | Mild
      12 | Yes  | humidity    | High
      12 | Yes  | wind        | Strong
      14 | No   | outlook     | Rain
      14 | No   | temperature | Mild
      14 | No   | humidity    | High
      14 | No   | wind        | Strong
       1 | No   | outlook     | Sunny
       1 | No   | temperature | Hot
       1 | No   | humidity    | High
       1 | No   | wind        | Weak
       3 | Yes  | outlook     | Overcast
       3 | Yes  | temperature | Hot
       3 | Yes  | humidity    | High
       3 | Yes  | wind        | Weak
       5 | Yes  | outlook     | Rain
       5 | Yes  | temperature | Cool
       5 | Yes  | humidity    | Normal
       5 | Yes  | wind        | Weak
       7 | Yes  | outlook     | Overcast
       7 | Yes  | temperature | Cool
       7 | Yes  | humidity    | Normal
       7 | Yes  | wind        | Strong
       9 | Yes  | outlook     | Sunny
       9 | Yes  | temperature | Cool
       9 | Yes  | humidity    | Normal
       9 | Yes  | wind        | Weak
      11 | Yes  | outlook     | Sunny
      11 | Yes  | temperature | Mild
      11 | Yes  | humidity    | Normal
      11 | Yes  | wind        | Strong
      13 | Yes  | outlook     | Overcast
      13 | Yes  | temperature | Hot
      13 | Yes  | humidity    | Normal
      13 | Yes  | wind        | Weak
      (56 rows)
  2. Create a training table.
    CREATE table weather_example_nb_training AS SELECT attr, 
    value, pivot_sum(array['Yes','No'], play, 1) as class_count 
    FROM weather_example_unpivot GROUP BY attr, value 
    DISTRIBUTED by (attr); 
    

    The SELECT * from weather_example_nb_training displays the training data and contains the following 10 rows.

     attr        | value    | class_count
    -------------+----------+-------------
    outlook      | Rain     | {3,2}
    humidity     | High     | {3,4}
    outlook      | Overcast | {4,0}
    humidity     | Normal   | {6,1}
    outlook      | Sunny    | {2,3}
    wind         | Strong   | {3,3}
    temperature  | Hot      | {2,2}
    temperature  | Cool     | {3,1}
    temperature  | Mild     | {4,2}
    wind         | Weak     | {6,2}
    (10 rows)
  3. Create a summary view of the training data.
    CREATE VIEW weather_example_nb_classify_functions AS SELECT 
    attr, value, class_count, array['Yes','No'] as 
    classes,sum(class_count) over (wa)::integer[] as 
    class_total,count(distinct value) over (wa) as attr_count 
    FROM weather_example_nb_training WINDOW wa as (partition by attr);
    

    The SELECT * from weather_example_nb_classify_function displays the training data and contains the following 10 rows.

    attr        |  value  | class_count| classes | class_total| attr_count
    ------------+-------- +------------+---------+------------+-----------
    temperature | Mild    | {4,2}      | {Yes,No}| {9,5}      |         3
    temperature | Cool    | {3,1}      | {Yes,No}| {9,5}      |         3
    temperature | Hot     | {2,2}      | {Yes,No}| {9,5}      |         3
    wind        | Weak    | {6,2}      | {Yes,No}| {9,5}      |         2
    wind        | Strong  | {3,3}      | {Yes,No}| {9,5}      |         2
    humidity    | High    | {3,4}      | {Yes,No}| {9,5}      |         2
    humidity    | Normal  | {6,1}      | {Yes,No}| {9,5}      |         2
    outlook     | Sunny   | {2,3}      | {Yes,No}| {9,5}      |         3
    outlook     | Overcast| {4,0}      | {Yes,No}| {9,5}      |         3
    outlook     | Rain    | {3,2}      | {Yes,No}| {9,5}      |         3
    (10 rows)
  4. Aggregate the data with nb_classify, nb_probabilities, or both.

    Decide what to classify. To classify only one record with the following values:

    temperature | wind | humidity | outlook
    ------------+------+----------+---------
    Cool        | Weak | High     | Overcast
    

    Use the following command to aggregate the data. The result gives the classification Yes or No and the probability of playing outdoor sports under this particular set of conditions.

    SELECT nb_classify(classes, attr_count, class_count, 
    class_total) as class,
           nb_probabilities(classes, attr_count, class_count, 
    class_total) as probability
    FROM weather_example_nb_classify_functions where
      (attr = 'temperature' and value = 'Cool') or
      (attr = 'wind'        and value = 'Weak') or
      (attr = 'humidity'    and value = 'High') or
      (attr = 'outlook'     and value = 'Overcast');
    

    The result is a single row.

    class  |              probability
    -------+---------------------------------------
     Yes   | {0.858103353920726,0.141896646079274}
    (1 row)
    

    To classify a group of records, load them into a table. In this example, the table t1 contains the following records:

     day | outlook  | temperature | humidity |  wind
    -----+----------+-------------+----------+--------
      15 | Sunny    | Mild        | High     | Strong
      16 | Rain     | Cool        | Normal   | Strong
      17 | Overcast | Hot         | Normal   | Weak
      18 | Rain     | Hot         | High     | Weak
    (4 rows)
    

    The following command aggregates the data against this table. The result gives the classification Yes or No and the probability of playing outdoor sports for each set of conditions in the table t1. Both the nb_classify and nb_probabilities aggregates are used.

    SELECT t1.day, 
           t1.temperature, t1.wind, t1.humidity, t1.outlook,
           nb_classify(classes, attr_count, class_count, 
    class_total) as class,
           nb_probabilities(classes, attr_count, class_count, 
    class_total) as probability
    FROM t1, weather_example_nb_classify_functions
    WHERE
      (attr = 'temperature' and value = t1.temperature) or
      (attr = 'wind'        and value = t1.wind) or
      (attr = 'humidity'    and value = t1.humidity) or
      (attr = 'outlook'     and value = t1.outlook)
    GROUP BY t1.day, t1.temperature, t1.wind, t1.humidity, 
    t1.outlook;
    

    The result is a four rows, one for each record in t1.

    day| temp| wind   | humidity | outlook  | class | probability
    ---+-----+--------+----------+----------+-------+--------------
    15 | Mild| Strong | High     | Sunny    | No    | {0.244694132334582,0.755305867665418}
    16 | Cool| Strong | Normal   | Rain     | Yes   | {0.751471997809119,0.248528002190881}
    18 | Hot | Weak   | High     | Rain     | No    | {0.446387538890131,0.553612461109869}
    17 | Hot | Weak   | Normal   | Overcast | Yes   | {0.9297192642788,0.0702807357212004}
    (4 rows)