December Activity Report 2008

16 12 2008
  • Operational events.
  • We’re pushing to update all of our Networker clients to the latest software version so that we can manage client updates from the Networker server.
  • I’ve identified a solution to our EDL capacity issues.  We will be configuring each of our LTO-4 media to contain only 200 GB of capacity.  This will allow for more granular chunks of storage to be used for backups on the Virtual Tape Library.


November 2008 Activity Report

24 11 2008
  • Kenon and I went to Raleigh, NC to evaluate a possible Celerra research solution.  We were there for two days and evaluated throughput and features of the proposed solution.
  • I updated our storage diagram to include our latest EDL purchase.  We also received our final licensing for the EDL.
  • We upgraded Networker to 7.4.3 recently and we’ve been dealing with a few bugs we have discovered.  Owner notifications stopped working, removal of temporary license keys caused the application to crash, and our throughput numbers are no longer being shown on the group monitor tabs of our management application.  So far EMC has provided a patch for the owner notification issue and the crash issue.  We have implemented the notification patch and we haven’t had any problems since with notifications.
  • We retired the StorageTek L700e this month.
  • Everything else this month has been operational.


October 2008 Activity Report

27 10 2008

Backup environment:

This month we upgraded Networker from 7.3.4 to 7.4.3.  This resolved several bugs and improved performance throughout the environment.  One of the most noticeable changes was an increase in the amount of detail available via logs.  However, the newer version is not perfect.  We have found two bugs so far in the new version.  One of these bugs is that the “Owner Notification” field is not being executed.  EMC has reproduced the bug in their lab and has forwarded it on to their development team to address.

We’ve completed the evaluation of the EMC VTL and we are in the process of completing purchase of the new hardware.  Once all of the details are finalized and we officially own it we’ll be able to accept new clients into our backup environment without over burdening the system.

I have created a software repository for client software in our backup environment.  Once all of our clients are upgraded to version 7.3 or greater we’ll be able to maintain client versions from the server from here on out.

Scripts:

I wrote a Perl script this month to keep track of all DNS queries going to our two DNS servers.  The results are sorted in descending order and then sent out via e-mail.  This will allow us to keep track of which lookups are most frequenly executed in our environment.

I have further refined our cloning scripts to only send cloning data over fiber channel.  This has eliminated the network component of our cloning processes and reduced the window to clone our data by another factor of two.

Operational:

I continue to support our various storage resources.  Backup is especially time intensive as we manage a lot of data and clients in our environment.  Additionally, there has been alot of work centered around the BlueArc and research storage in general this month.

Team Lead:

I was appointed team lead of the storage environment this month and this role is new to me.  It has already provided a source of additional meetings as well as responsibility in the way of an even more active role in how our storage environment is used by the community.  I look forward to the challenge.



September 2008 Activity Report

25 09 2008

This month I have been working on the evaluation of a VTL (virtual tape library) for our backup environment.

We had a couple initial issues that caused restore problems, but I have since ironed those out and everything appears to be running smoothly.  We are getting very good performance through the VTL we are evaluating right now.  Our only current concern is failover between engines.  This solution runs an active/active configuration and one engine will take over the other engine’s work load in the event that one goes offline.  We had an unplanned production test of this feature a couple of weeks ago that resulted in the hardware locking up and going offline.  It look almost two days to restore service.

Operationally I have been working on fine tuning our backup pools, groups, client priorities, and cloning processes to optimize performance on the VTL.



Activity Report August 2008

25 08 2008
  • We installed 2 new disk array enclosures and 2 new fiber channel switches into the BlueArc environment to expand existing capacity.
  • I optimized the cloning scripts for the i2k tape library data in order to speed up cloning time.  Cloning is now organized such that each thread will completely clone all data available to clone on a single tape before moving onto the next tape.  This should decrease time added to the cloning procedure by loading and unloading tapes.
  • We’ve been plagued by a couple of issues with our tape drives on the new i2k.  A firmware upgrade appears to have fixed the issue.
  • We are continuing to work on the VTL evaluation.  The hardware install has been completed.  We are now coordinating with EMC to get software resources to complete configuration so that we can test the VTL in our environment.
  • I have continued to investigate any client failing backups within our environment and determining causes so that we can have as complete a backup of all of our systems as possible.
  • Operational activities have consumed the remainder of my time.


Activity Report July 2008

25 07 2008

This month saw many changes to our storage environment, here is a list:

  • Six 146G Fiber Channel DAE’s installed in the Hill Center CX3-80.
  • One 1TB SATA DAE installed in the Hill Center CX3-80.
  • One 1TB SATA DAE installed in the Stevenson Center CX3-40.
  • Cava upgraded on the two virus check servers for the Celerra.
  • Flare 26 patches applied to our Clariion service processors to address a bug.
  • New Quantum I2000 library installed into our backup environment.
  • BlueArc request pages added to the storage site to facilitate administrative requests from other departments.

In addition to these changes I have worked on the following:

  • Update of backup cloning scripts to support the I2k Library.
  • Setting up VTL evaluations.
  • Update storage diagram to include purchased equipment and removal of old equipment.


Activity Report June 2008

25 06 2008

This month we made a decision on a new tape library and have purchased it.  The new library is a Quantum I2k and should be here within a week.  In preparation for the arrival of the new hardware we have arranged installs of new HBA’s in our existing storage nodes and developed a strategy as to how the new device will be attached to our environment.  The addition of the new library to our backup environment will complete the life cycle replacement of our Sun STK L700 as well as give us a lot of additional capacity in our backup environment.  The main concern will be cloning speed and this should be greatly enhanced as well.  The addition of a VTL or similar technology will most likely be necessary to fully keep up with our cloning operations.

I have authored another Perl script which will keep track of all of our backup clients and act as the definitive list for clients configured in our backup environment.  This script launches every night at midnight and creates a master list of clients used for reporting purposes.  A second Perl script uses this master file to create three seperate lists used for cloning.  Cloning is now accomplished one third of our environment at a time over the course of a month.  I will most likely attempt to do the entire environment in a single week once the new library is in place.

Other than that it’s been a lot of operational work over this month.



May 2008 Activity Report

28 05 2008

I worked with EMC this past month to evaluate EBA. After using the product for that time we have decided that as the product currently stands it is not of much use to us. It gathers available logs from several pieces of the storage environment and rolls them up to a single pane for review. However, it consumed over 40% of the processor on our primary backup server. The functionality it provided was not worth the performance impact to our systems.This past weekend I attempted to migrate a large remaining portion of our storage environment to the admin network. Unfortunately, despite information from EMC that only the RM server for Exchange needed to communicate to the Clariions, the RM application was broken by the move. We learned during this process that the Exchange servers still need to communicate to the Clariions as well. We will be working with the mail team to try and get the Exchange servers dual homed and resolve an application issue preventing it.

I also developed scripts this month to pull CO-LO usage information from our Clariion arrays. This information was requested by management to help determine what ITS storage resources are dedicated to other departments.

Finally this month we reviewed several tape libraries for life cycle replacement of our current Sun StorageTek L700. We have made a decision and presented it to management.



April Activity Report

28 04 2008

I took the SNIA SCSP test this month and passed.

EBA Evaluation:I have been working with EMC to conduct another evaluation of the EBA product. I have experienced several issues implementing it in our environment. The software was unable to monitor any of our network connections without a patch that they had to develop for us during this process. We have also been unable to monitor our fiber ports on the backup servers. It appears the cause of this problem is the absence of some SNIA libraries. Unfortunately, these libraries are included with vendor drivers for the HBAs and we currently use the Redhat provided drivers built into the kernel instead of vendor drivers. I am reluctant to switch to vendor drivers now that we have resolved several scsi issues we were having in the past within our tape library environment. We have also been unable to monitor our tape library with this product. The issue appears to be an SNMP problem and we are still trying to ascertain what is causing this issue. Finally the final issue is that the product doesn’t appear to be giving me the level of detail I had hoped to see for problem identification and resolution. Our main reason for evaluating this product again was for the ability to identify, locate, and resolve issues within our backup environment. Perhaps we will see more of this type of utility once we get some of the other issues resolved.

BlueArc Performance issues:

We have seen a drastic change in our usage pattern on the BlueArc NAS appliance. We went from a max utilization of around 50% to being maxed out almost continuously for several days at a time. The culprit appears to be an increase in usage from the ACCRE cluster, but I am still working with the vendor to make sure this is the actual cause and not a system problem of some kind.

Backup Operations:

Finally, backup has been going pretty well this month. The numerous changes to the environment last month appear to have stabilized the application and the tape library maintenance has addressed the stability of our library. This month has been primarily spent addressing bottle necks within our backup infrastructure. We also have two new servers on site to be configured as additional storage nodes.



March Activity Report 2008

25 03 2008

backup, Backup, BACKUP!!

Here is a list of changes made to our environment in an attempt to improve stability:

Problem: Networker application was locking up.

Cause: Clone job sets were too large.

Solution: All new clone scripts were written to change the methodology behind our clone operations.

Problem: Staging processes from the disk target in our backup environment would cause Networker to lock up.

Cause: The staging processes use cloning in the background and were creating clone jobs that were too large.

Solution: Adopted newly created cloning scripts to handle staging from disk devices.

Problem: Networker application was locking up.

Cause: Security scanner was scanning application ports.

Solution: Disable security scans for Networker application ports.

Problem: RPC timeouts would cause Networker to lose communication with the daemon processes responsible for tape drive access.

Cause: Possibly the clone set problem, or RPC tuning issues.

Solution: New clone scripts were written and RPC tunes applied.

Problem: Some data was not being cloned.

Cause: More full backup data was being generated daily than what could be cloned.

Solution: Change in cloning policy implemented to only have one monthly copy offsite.

Problem: A tape drive that was working on the Networker server and the Celerra failed to eject properly when accessed from the Storage Node.

Cause: The kernel versions were different between the Networker server and Storage node. The scsi drives are included in the kernel.

Solution: We upgraded the kernel on both machines to the latest version.

Problem: Clients getting an RPC timeout during backup operations:

Cause: Service port would exit before transmission was completed on the data ports.

Solution: Timeout values were adjusted at the group level in the Networker group configurations.

Problem: Clients with multiple IP addresses failing to backup.

Cause: Clients had been configured with the public name and Networker was confused as to the client’s identity.

Solution: Clients were destroyed and recreated using the client identifier to maintain save sets.

Problem: Admin network backup traffic was traveling over the public network if the storage node was the secondary backup server.

Cause: Networker identified the storage node via its public name.

Solution: A client side host entry was created to point the public name at the private IP address.

Problem: NDMP file level restore operations did not function.

Cause: Authorization had not been granted at the Networker client configuration level.

Solution: Authorization was granted.

Problem: pre_clone.sh script was being run every time a save group completed.

Cause: Script was initially used to initiate clone operations. Old notification rule set was causing the script to be executed every time a save group completed.

Solution: Notification rule was removed.

Problem: Several unusable tape drive resources were being shown as available in the Networker application.

Cause: Networker scans the scsi bus to include all devices it can see. Invalid drives were an artifact created by the udev process on our Linux backup servers.

Solution: Each device was manually configured and removed.

Problem: Volume identifiers were being corrupted in Networker.

Cause: Scanner on robot arm of the library was not correctly reading labels.

Solution: Scanner was replaced.

Problem: Tape drives would automatically disable. (not scsi reset)

Cause: Tape drives had physical defects.

Solution: Drives are being replaced as needed.

Problem: Reboot of Networker server causes the jukebox to jump SCSI addresses and results in having to rebuild the jukebox resource.

Cause: Linux only supports persistent binding at the device file level. Networker only recognizes the robot arm at the SCSI address level.

Solution (Mitigated): New procedure to change SCSI control port “on the fly” implemented.

Finally, I also went to SNIA class and learned quite a bit of information. The SNIA material is all based off of ITIL principles and will be a good source of information. I will be taking the SNIA Certified Professional test in a couple weeks.