backup, Backup, BACKUP!!
Here is a list of changes made to our environment in an attempt to improve stability:
Problem: Networker application was locking up.
Cause: Clone job sets were too large.
Solution: All new clone scripts were written to change the methodology behind our clone operations.
Problem: Staging processes from the disk target in our backup environment would cause Networker to lock up.
Cause: The staging processes use cloning in the background and were creating clone jobs that were too large.
Solution: Adopted newly created cloning scripts to handle staging from disk devices.
Problem: Networker application was locking up.
Cause: Security scanner was scanning application ports.
Solution: Disable security scans for Networker application ports.
Problem: RPC timeouts would cause Networker to lose communication with the daemon processes responsible for tape drive access.
Cause: Possibly the clone set problem, or RPC tuning issues.
Solution: New clone scripts were written and RPC tunes applied.
Problem: Some data was not being cloned.
Cause: More full backup data was being generated daily than what could be cloned.
Solution: Change in cloning policy implemented to only have one monthly copy offsite.
Problem: A tape drive that was working on the Networker server and the Celerra failed to eject properly when accessed from the Storage Node.
Cause: The kernel versions were different between the Networker server and Storage node. The scsi drives are included in the kernel.
Solution: We upgraded the kernel on both machines to the latest version.
Problem: Clients getting an RPC timeout during backup operations:
Cause: Service port would exit before transmission was completed on the data ports.
Solution: Timeout values were adjusted at the group level in the Networker group configurations.
Problem: Clients with multiple IP addresses failing to backup.
Cause: Clients had been configured with the public name and Networker was confused as to the client’s identity.
Solution: Clients were destroyed and recreated using the client identifier to maintain save sets.
Problem: Admin network backup traffic was traveling over the public network if the storage node was the secondary backup server.
Cause: Networker identified the storage node via its public name.
Solution: A client side host entry was created to point the public name at the private IP address.
Problem: NDMP file level restore operations did not function.
Cause: Authorization had not been granted at the Networker client configuration level.
Solution: Authorization was granted.
Problem: pre_clone.sh script was being run every time a save group completed.
Cause: Script was initially used to initiate clone operations. Old notification rule set was causing the script to be executed every time a save group completed.
Solution: Notification rule was removed.
Problem: Several unusable tape drive resources were being shown as available in the Networker application.
Cause: Networker scans the scsi bus to include all devices it can see. Invalid drives were an artifact created by the udev process on our Linux backup servers.
Solution: Each device was manually configured and removed.
Problem: Volume identifiers were being corrupted in Networker.
Cause: Scanner on robot arm of the library was not correctly reading labels.
Solution: Scanner was replaced.
Problem: Tape drives would automatically disable. (not scsi reset)
Cause: Tape drives had physical defects.
Solution: Drives are being replaced as needed.
Problem: Reboot of Networker server causes the jukebox to jump SCSI addresses and results in having to rebuild the jukebox resource.
Cause: Linux only supports persistent binding at the device file level. Networker only recognizes the robot arm at the SCSI address level.
Solution (Mitigated): New procedure to change SCSI control port “on the fly” implemented.
Finally, I also went to SNIA class and learned quite a bit of information. The SNIA material is all based off of ITIL principles and will be a good source of information. I will be taking the SNIA Certified Professional test in a couple weeks.