Detecting a Failed Segment

Detecting a Failed Segment

With segment mirroring enabled, Greenplum Database automatically fails over to a mirror segment instance when a primary segment instance goes down. Provided one segment instance is online per portion of data, users may not realize a segment is down. If a transaction is in progress when a fault occurs, the in-progress transaction rolls back and restarts automatically on the reconfigured set of segments. The gpstate utility can be used to identify failed segments. The utility displays information from the catalog tables including gp_segment_configuration.

If the entire Greenplum Database system becomes nonoperational due to a segment failure (for example, if mirroring is not enabled or not enough segments are online to access all user data), users will see errors when trying to connect to a database. The errors returned to the client program may indicate the failure. For example:

ERROR: All segment databases are unavailable

How a Segment Failure is Detected and Managed

On the Greenplum Database master host, the Postgres postmaster process forks a fault probe process, ftsprobe. This is also known as the FTS (Fault Tolerance Server) process. The postmaster process restarts the FTS if it fails.

The FTS runs in a loop with a sleep interval between each cycle. On each loop, the FTS probes each primary segment instance by making a TCP socket connection to the segment instance using the hostname and port registered in the gp_segment_configuration table. If the connection succeeds, the segment performs a few simple checks and reports back to the FTS. The checks include executing a stat system call on critical segment directories and checking for internal faults in the segment instance. If no issues are detected, a positive reply is sent to the FTS and no action is taken for that segment instance.

If the connection cannot be made, or if a reply is not received in the timeout period, then a retry is attempted for the segment instance. If the configured maximum number of probe attempts fail, the FTS probes the segment's mirror to ensure that it is up, and then updates the gp_segment_configuration table, marking the primary segment "down" and setting the mirror to act as the primary. The FTS updates the gp_configuration_history table with the operations performed.

When there is only an active primary segment and the corresponding mirror is down, the primary goes into the not synchronizing state and continues logging database changes, so the mirror can be synchronized without performing a full copy of data from the primary to the mirror.

Running the gpstate utility with the -e option displays any issues with a primary or mirror segment instances. Other gpstate options that display information about all primary or mirror segment instances such as -m (mirror instance information) and -c (primary and mirror configuration information) also display information about primary and mirror issues.

You can also can see the mode: s (synchronizing) or n (not synchronizing) for each segment instance, as well as the status u (up) or d (down), in the gp_segment_configuration table.

The gprecoverseg utility is used to bring up a mirror that is down. By default, gprecoverseg performs an incremental recovery, placing the mirror into synchronizing mode, which starts to replay the recorded changes from the primary onto the mirror. If the incremental recovery cannot be completed, the recovery fails and gprecoverseg should be run again with the -F option, to perform full recovery. This causes the primary to copy all of the data to the mirror.

After a segment instance has been recovered, the gpstate -e command might list primary and mirror segment instances that are switched. This indicates that the system is not balanced (the primary and mirror instances are not in their originally configured roles). If a system is not balanced, there might be skew resulting from the number of active primary segment instances on segment host systems.

The gp_segment_configuration table has columns role and preferred_role. These can have values of either p for primary or m for mirror. The role column shows the segment instance current role and the preferred_role shows the original role of the segment instance.

In a balanced system, the role and preferred_role matches for all segment instances. When they do not match the system is not balanced. To rebalance the cluster and bring all the segments into their preferred role, run the gprecoverseg command with the -r option.

Simple Failover and Recovery Example

Consider a single primary-mirror segment instance pair where the primary segment has failed over to the mirror. The following table shows the segment instance preferred role, role, mode, and status from gp_segment_configuration table before beginning recovery of the failed primary segment.

You can also run gpstate -e to display any issues with a primary or mirror segment instances.

  preferred_role role mode status
Primary p

(primary)

m

(mirror)

n

(not synchronizing)

d

(down)

Mirror m

(mirror)

p

(primary)

n

(not synchronizing)

u

(up)

The segment instance roles are not in their preferred roles, and the primary is down. The mirror is up, the role is now primary, and it is not synchronizing because its mirror, the failed primary, is down. After fixing issues with the segment host and primary segment instance, you use gprecoverseg to prepare failed segment instances for recovery and initiate synchronization between the primary and mirror instances.

Once gprecoverseg has completed, the segments are in the states shown in the following table where the primary-mirror segment pair is up with the primary and mirror roles reversed from their preferred roles.

  preferred_role role mode status
Primary p

(primary)

m

(mirror)

s

(synchronizing)

u

(up)

Mirror m

(mirror)

p

(primary)

s

(synchronizing)

u

(up)

The gprecoverseg -r command rebalances the system by returning the segment roles to their preferred roles.

  preferred_role role mode status
Primary p

(primary)

p

(primary)

s

(synchronized)

u

(up)

Mirror m

(mirror)

m

(mirror)

s

(synchronized)

u

(up)

Configuring FTS Behavior

There is a set of server configuration parameters that affect FTS behavior:
gp_fts_probe_interval
How often, in seconds, to begin a new FTS loop. For example if the setting is 60 and the probe loop takes 10 seconds, the FTS process sleeps 50 seconds. If the setting is 60 and probe loop takes 75 seconds, the process sleeps 0 seconds. The default is 60, and the maximum is 3600.
gp_fts_probe_timeout
Probe timeout between master and segment, in seconds. The default is 20, and the maximum is 3600.
gp_fts_probe_retries
The number of attempts to probe a segment. For example if the setting is 5 there will be 4 retries after the first attempt fails. Default: 5
gp_log_fts
Logging level for FTS. The value may be "off", "terse", "verbose", or "debug". The "verbose" setting can be used in production to provide useful data for troubleshooting. The "debug" setting should not be used in production. Default: "terse"
gp_segment_connect_timeout
The maximum time (in seconds) allowed for a mirror to respond. Default: 600 (10 minutes)

In addition to the fault checking performed by the FTS, a primary segment that is unable to send data to its mirror can change the status of the mirror to down. The primary queues up the data and after gp_segment_connect_timeout seconds passes, indicates a mirror failure, causing the mirror to be marked down and the primary to go into change tracking mode.