Thursday, 13 August 2015

Status Code: 54 Timed out connecting to client

Status Code: 54
Timed out connecting to client

A Status Code 54 will occur when the server could not complete the connection to the client.  The accept system or winsock call timed out after 60 seconds.  This problem can occur when a master/media server tries to connect to bpcd on the client machine and the client fails to respond before the software times out after 60 seconds (default timer setting).   The processes involved in this function are: bpcd, bprd, bpbrm and vnetd (if vnetd is configured for firewall operation).


Table of Contents

1      General Status 54 troubleshooting............................................................................................... 3
1.1        Verify NetBackup Server processes are running:.................................................................. 3
1.2        Verify the NetBackup Client Daemon (bpcd) is listening....................................................... 3
1.3        Use the telnet command to test the NetBackup daemons..................................................... 4
1.4        Check logging on the affected Media Server and Client(s):................................................... 4
1.5        Is a firewall present in the configuration?............................................................................. 5
1.6        Are there any networking issues?........................................................................................ 6
1.6.1         Name resolution issues:.............................................................................................. 6
1.6.2         Network performance issues....................................................................................... 6
1.7        Is there a machine resource issue?...................................................................................... 6
2      Troubleshooting for NetBackup Database Agents........................................................................ 6
3      Links......................................................................................................................................... 7



1      General Status 54 troubleshooting

The goal here is to isolate the issue to a machine or pair of machines.  Since the issue deals with socket to socket communications, the issue could be happening between the Master or Media server, and the client.  The key to success is to isolate which socket is not being established. Once this is done, then further troubleshooting can be done to isolate the failure to a specific function, network configuration, performance issue, machine resources, etc.

1.1    Verify NetBackup Server processes are running:

On the Windows NT/2000 or UNIX master server, verify the NetBackup Request Manager (bprd), NetBackup Job Daemon (bpjobd), and NetBackup Database Manager (bpdbm) services are running. These daemons must be running on the master server.

Open a command prompt in Windows, or open a shell in UNIX as root, and run following command:

For Windows run:
% <install_path>\VERITAS\NetBackup\bin\bpps
* 1TIME                                     4/25/05 14:02:30.203
COMMAND      PID      LOAD     TIME   MEM                  START
bpdbm       1912    0.000%    0.031  4.4M   4/25/05 13:07:45.890
bprd        2080    0.000%    0.125  4.7M   4/25/05 13:07:47.531
bpjobd      3248    0.000%    0.171  3.4M   4/25/05 13:07:48.703
<Media manager processes would be displayed here>

For UNIX run:
# /usr/openv/netbackup/bin/bpps –a
NB Processes
------------
root 18633     1  0  Apr 22 ?  0:01 /usr/openv/netbackup/bin/bpdbm
root 18620     1  0  Apr 22 ?  0:02 /usr/openv/netbackup/bin/bprd
root 18641 18633  0  Apr 22 ?  0:05 /usr/openv/netbackup/bin/bpjobd
<Media manager processes would be displayed here>

If these services are not running on the Windows master, start them.
On the Windows Desktop:
1.     Right-click on My Computer on the desktop, or within the Start Menu and choose "Manage".
2.     Expand Services and Applications and highlight Services.
3.     Locate the NetBackup services (NetBackup Request Manager, NetBackup Database Manager, NetBackup Client Service, NetBackup Volume Manager, NetBackup Device Manager) and verify they are started.
4.     If services are not started then right-click on each service and choose Start.

If these services are not running on the UNIX master, start them.
# /usr/openv/netbackup/bin/goodies/netbackup start

1.2    Verify the NetBackup Client Daemon (bpcd) is listening.

The client daemons such as NetBackup Client Service (bpcd), etc. are started from bpinetd.exe on Windows or inetd\xinetd on UNIX\Linux and won’t appear in the bpps output.  Instead the netstat command can be used to verify these daemons are in LISTEN status.  On the Master and the affected client(s) run the command below to verify if bpcd is listening.

Windows: netstat -a > c:\netstat.txt
UNIX:    netstat –a > /tmp/netstat.txt

The netstat.txt file that gets created should list the listening processes that are running (bpcd, vnetd, vopied, bpjava-msvc).  Search this file to determine if bpcd is in LISTEN status.  The vnetd process should also be in LISTEN status if vnetd is being used for firewalls.

Windows: TCP    hostname:bpcd    hostname.domain.com:0  LISTENING
UNIX:    *.bpcd     *.*     0  0  49152   LISTEN

1.3    Use the telnet command to test the NetBackup daemons

Another test after the problem systems have been identified would be to try to telnet to NetBackup well known ports from machine to machine.  For example from the Master server, a telnet session could be run to the Media server or client and visa versa:

From Master command line:
# telnet <machine name or machine IP address> bpcd
This will connect to the target machine and display a message similar to the ones below:

For UNIX:
If telnet is successful you will get a message similar to:
# telnet nbclient bpcd
Trying x.x.x.x...
Connected to nbclient.domain.com.
Escape character is '^]'.
< If successful no additional messages will be returned >
Press enter to end telnet session.

If telnet is unsuccessful you will get a message similar to:
# telnet nbclient bpcd
Trying x.x.x.x...
telnet: Unable to connect to remote host: Connection refused
The telnet session will end automatically and return to the prompt.

For Windows:
If telnet is successful you will get a message similar to:
% telnet nbclient bpcd
< If successful no displayed messages will be returned >
Press enter  to end telnet session.

If telnet is unsuccessful you will get a message similar to:
% telnet nbclient bpcd
Connecting to nbclient. . .Could not open a connection to host on port 13782 : Connect failed
The telnet session will end automatically and return to the prompt.

This is also a very good test for firewall issues to see if a path is open through the firewall.
This test can be repeated for connection testing to bprd, bpdbm, and vnetd.

1.4    Check logging on the affected Media Server and Client(s):

Examine the All Log Entries report for the time of the failure to determine where the failure occurred. Also view the logging information detailed in the previous flow chart for error and failure information. This log information is the best way to isolate where the problem is occurring and what machines are involved in the issue, and will enable you to narrow your focus and concentrate your troubleshooting efforts.



The Media server bpbrm log and the client bpcd log will contain identical logconnections lines:
<2> logconnections: BPCD ACCEPT FROM x.x.x.x.<port> TO y.y.y.y.13782

The x.x.x.x will be the source IP address for the connection.  Verify this is using the expected network interface.  The client will need to have forward and reverse name lookup information for this IP address.

Example from a UNIX Media server /usr/openv/netbackup/logs/bpbrm/log.<date> file:
<2> bpcr_connect: bpcr_connect timeout during select after 60 seconds on port <port>
<16> bpbrm start_bpcd: timed out trying to connect to <hostname>

This indicates the client did not reply to the server before the 60 second socket timeout.  In this case check the client’s bpcd log for additional troubleshooting information.

Example from a UNIX client /usr/openv/netbackup/logs/bpcd/log.<date> file:
<8> bpcd peer_hostname: gethostbyaddr failed: HOST_NOT_FOUND (1)
<16> bpcd peer_hostname: gethostbyaddr failed to return peer host, herrno = 1
<16> bpcd main: Couldn't get peer hostname

Example from a UNIX client /usr/openv/netbackup/logs/bpcd/log.<date> file:
<2> hosts_equal: gethostbyname failed for <hostname>: No such host is known. (0)

This would indicate a failure with the name or reverse name lookup of the master or media sever.   NetBackup does a reverse name lookup of the IP in order to get the name to authenticate against the SERVER entry in the Windows Registry or the UNIX /usr/openv/netbackup/bp.conf.

After reviewing the log files, a better idea of what machines are involved in the failure should be evident.

For name lookup errors add an entry to the /etc/hosts on UNIX or the C:\WINDOWS\system32\drivers\etc\hosts on Windows and try the operation again.
x.x.x.x  master   master.domain.com

1.5    Is a firewall present in the configuration?

If so are all of the required ports open?  Check the NetBackup System Administrator Guide (for UNIX or Windows) for firewall and port information.  At a minimum ports 13782 (bpcd) and 13724 (vnetd) need to be opened in the firewall for a client backup.  This requires configurations to be made for the client on the master before it will work.  Additional ports are required for restores or if the client is also a media server.

Example from a UNIX client /usr/openv/netbackup/logs/bpcd/log.<date> file:
<2> bpcd peer_hostname: Connection from host <hostname> (x.x.x.x) port <reserved port>
<2> bpcd main: Peer hostname is <hostname>
<2> nb_bind_on_port_addr: bound to port <reserved port>
<2> bpcd main: Got socket for output 5, lport = <reserved port>

This would indicate the client is using the default of reserved ports for the callback.  The nb_bind_on_port_addr: call will display the reserved port number being used for the callback.  A firewall will most likely be blocking reserved ports which will cause the backup to abort on the media server with a status 54.



Example from a UNIX client /usr/openv/netbackup/logs/bpcd/log.<date> file:
<4> bpcd valid_server: hostname comparison succeeded
<2> bpcd main: output socket port number = 13782
Note: For NetBackup 5.x there will be a dozen “<2> vnet vnetd_<function>” log entries between these lines.
<2> get_vnetd_socket: connected to vnetd socket 5

This would indicate the client is using vnetd port for callbacks.  The nb_bind_on_port_addr: call will not appear in the logs when vnetd is used for callbacks. 

1.6    Are there any networking issues?

1.6.1    Name resolution issues:

Use the bpclntcmd to test name lookups in both directions.  This should be run against both the hostname and IP address of each machine involved in order to test both forward and reverse name lookups. 
Review the following Technote http://support.veritas.com/docs/261393 for details on using the bpclntcmd command.

1.6.2    Network performance issues

·         Duplex issues
Commands to run:  “netstat –ian” to check for Ierrs or Oerrs.
·         Routing issues
Commands to run: “netstat –rn”, “traceroute” or “ifconfig –a” to check for routing or subnet mask errors.
·         Network bottlenecks
Commands to run: “ftp” or “ttcp” to test underlying network performance.

1.7    Is there a machine resource issue?

Verify VERITAS suggested minimum kernel parameters are in place for UNIX machines.  Review the following Technote:  http://seer.support.veritas.com/docs/238063.htm

2      Troubleshooting for NetBackup Database Agents

Script-based NetBackup database clients such as DB2, Informix, Oracle, SAP, Sybase, SQL-Server, and Teradata require additional troubleshooting to resolve status 54’s.  These clients use comm files in the /usr/openv/netbackup/logs/user_ops directory tree that must be updated by the master and the media server and then read by the client prior to establishing the Name and Data sockets.

First, three connections to the client occur from the master and then the media server.  These connections use bpcd on the client, including the server connect-back, to update the comm file with job progress information and eventually the hostname and additional port numbers that the client should use to establish the Name and Data sockets.  Troubleshooting a status 54 during this portion of the backup or restore is identical to the steps for a standard backup described in this document.

A second cause for a status 54 on a database client backup or restore occurs when the client fails to receive an expected update from either the master or the media server before the CLIENT_READ_TIMEOUT or other timeout expires on the client.  Upon timeout, the database client will exit in error.  Eventually the job will become active and bpbrm will bind to ports for the Name and Data sockets, write the port numbers into the comm file, and wait for the database client


to connect-back.  If the connect-back does not occur, within 60 seconds of the comm file update, bpbrm will fail the job with a status 54.  The bpbrm log on the media server will show the additional ports for the sockets along with the media server hostname to which the client should use to complete the connect-back. 

<2> bpbrm listen_for_client: HOT_ORACLE_DB_BACKUP
<2> bpbrm listen_for_client: bpbrm.c.19241: listen(2)ing on port: 3826 3826 0x00000ef2
<2> bpbrm listen_for_client: bpbrm.c.19243: listen(2)ing on port: 4941 4941 0x0000134d
<2> bpcr_get_peername_rqst: Server peername length = 8
<2> bpbrm write_msg_to_progress_file: INF - Data socket = sv2n2adm.3826
<2> bpbrm write_msg_to_progress_file: INF - Name socket = sv2n2adm.4941

Please note that the hostname provided in the comm file may differ from the expected hostname for the media server.  Such a mismatch is a third potential cause for a status 54 on a database client backup or restore.   If the client cannot resolve the provided hostname and complete the socket through the network, then bpbrm will timeout after 60 seconds and fail the job with a status 54.

Hence, it is vitally important that the database client log be checked to determine if the database client has already exited, is denied a socket by the network, is unable to bind to a local port, or is otherwise unable to read the comm file.  To determine the exact cause, enable logging for the database client per the Troubleshooting instructions in the VERITAS NetBackup ™ for <Database agent> System Administrators Guide.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.