Activity Report August 2008

25 08 2008
  • We installed 2 new disk array enclosures and 2 new fiber channel switches into the BlueArc environment to expand existing capacity.
  • I optimized the cloning scripts for the i2k tape library data in order to speed up cloning time.  Cloning is now organized such that each thread will completely clone all data available to clone on a single tape before moving onto the next tape.  This should decrease time added to the cloning procedure by loading and unloading tapes.
  • We’ve been plagued by a couple of issues with our tape drives on the new i2k.  A firmware upgrade appears to have fixed the issue.
  • We are continuing to work on the VTL evaluation.  The hardware install has been completed.  We are now coordinating with EMC to get software resources to complete configuration so that we can test the VTL in our environment.
  • I have continued to investigate any client failing backups within our environment and determining causes so that we can have as complete a backup of all of our systems as possible.
  • Operational activities have consumed the remainder of my time.


Activity Report July 2008

25 07 2008

This month saw many changes to our storage environment, here is a list:

  • Six 146G Fiber Channel DAE’s installed in the Hill Center CX3-80.
  • One 1TB SATA DAE installed in the Hill Center CX3-80.
  • One 1TB SATA DAE installed in the Stevenson Center CX3-40.
  • Cava upgraded on the two virus check servers for the Celerra.
  • Flare 26 patches applied to our Clariion service processors to address a bug.
  • New Quantum I2000 library installed into our backup environment.
  • BlueArc request pages added to the storage site to facilitate administrative requests from other departments.

In addition to these changes I have worked on the following:

  • Update of backup cloning scripts to support the I2k Library.
  • Setting up VTL evaluations.
  • Update storage diagram to include purchased equipment and removal of old equipment.


Activity Report June 2008

25 06 2008

This month we made a decision on a new tape library and have purchased it.  The new library is a Quantum I2k and should be here within a week.  In preparation for the arrival of the new hardware we have arranged installs of new HBA’s in our existing storage nodes and developed a strategy as to how the new device will be attached to our environment.  The addition of the new library to our backup environment will complete the life cycle replacement of our Sun STK L700 as well as give us a lot of additional capacity in our backup environment.  The main concern will be cloning speed and this should be greatly enhanced as well.  The addition of a VTL or similar technology will most likely be necessary to fully keep up with our cloning operations.

I have authored another Perl script which will keep track of all of our backup clients and act as the definitive list for clients configured in our backup environment.  This script launches every night at midnight and creates a master list of clients used for reporting purposes.  A second Perl script uses this master file to create three seperate lists used for cloning.  Cloning is now accomplished one third of our environment at a time over the course of a month.  I will most likely attempt to do the entire environment in a single week once the new library is in place.

Other than that it’s been a lot of operational work over this month.



May 2008 Activity Report

28 05 2008

I worked with EMC this past month to evaluate EBA. After using the product for that time we have decided that as the product currently stands it is not of much use to us. It gathers available logs from several pieces of the storage environment and rolls them up to a single pane for review. However, it consumed over 40% of the processor on our primary backup server. The functionality it provided was not worth the performance impact to our systems.This past weekend I attempted to migrate a large remaining portion of our storage environment to the admin network. Unfortunately, despite information from EMC that only the RM server for Exchange needed to communicate to the Clariions, the RM application was broken by the move. We learned during this process that the Exchange servers still need to communicate to the Clariions as well. We will be working with the mail team to try and get the Exchange servers dual homed and resolve an application issue preventing it.

I also developed scripts this month to pull CO-LO usage information from our Clariion arrays. This information was requested by management to help determine what ITS storage resources are dedicated to other departments.

Finally this month we reviewed several tape libraries for life cycle replacement of our current Sun StorageTek L700. We have made a decision and presented it to management.



April Activity Report

28 04 2008

I took the SNIA SCSP test this month and passed.

EBA Evaluation:I have been working with EMC to conduct another evaluation of the EBA product. I have experienced several issues implementing it in our environment. The software was unable to monitor any of our network connections without a patch that they had to develop for us during this process. We have also been unable to monitor our fiber ports on the backup servers. It appears the cause of this problem is the absence of some SNIA libraries. Unfortunately, these libraries are included with vendor drivers for the HBAs and we currently use the Redhat provided drivers built into the kernel instead of vendor drivers. I am reluctant to switch to vendor drivers now that we have resolved several scsi issues we were having in the past within our tape library environment. We have also been unable to monitor our tape library with this product. The issue appears to be an SNMP problem and we are still trying to ascertain what is causing this issue. Finally the final issue is that the product doesn’t appear to be giving me the level of detail I had hoped to see for problem identification and resolution. Our main reason for evaluating this product again was for the ability to identify, locate, and resolve issues within our backup environment. Perhaps we will see more of this type of utility once we get some of the other issues resolved.

BlueArc Performance issues:

We have seen a drastic change in our usage pattern on the BlueArc NAS appliance. We went from a max utilization of around 50% to being maxed out almost continuously for several days at a time. The culprit appears to be an increase in usage from the ACCRE cluster, but I am still working with the vendor to make sure this is the actual cause and not a system problem of some kind.

Backup Operations:

Finally, backup has been going pretty well this month. The numerous changes to the environment last month appear to have stabilized the application and the tape library maintenance has addressed the stability of our library. This month has been primarily spent addressing bottle necks within our backup infrastructure. We also have two new servers on site to be configured as additional storage nodes.



March Activity Report 2008

25 03 2008

backup, Backup, BACKUP!!

Here is a list of changes made to our environment in an attempt to improve stability:

Problem: Networker application was locking up.

Cause: Clone job sets were too large.

Solution: All new clone scripts were written to change the methodology behind our clone operations.

Problem: Staging processes from the disk target in our backup environment would cause Networker to lock up.

Cause: The staging processes use cloning in the background and were creating clone jobs that were too large.

Solution: Adopted newly created cloning scripts to handle staging from disk devices.

Problem: Networker application was locking up.

Cause: Security scanner was scanning application ports.

Solution: Disable security scans for Networker application ports.

Problem: RPC timeouts would cause Networker to lose communication with the daemon processes responsible for tape drive access.

Cause: Possibly the clone set problem, or RPC tuning issues.

Solution: New clone scripts were written and RPC tunes applied.

Problem: Some data was not being cloned.

Cause: More full backup data was being generated daily than what could be cloned.

Solution: Change in cloning policy implemented to only have one monthly copy offsite.

Problem: A tape drive that was working on the Networker server and the Celerra failed to eject properly when accessed from the Storage Node.

Cause: The kernel versions were different between the Networker server and Storage node. The scsi drives are included in the kernel.

Solution: We upgraded the kernel on both machines to the latest version.

Problem: Clients getting an RPC timeout during backup operations:

Cause: Service port would exit before transmission was completed on the data ports.

Solution: Timeout values were adjusted at the group level in the Networker group configurations.

Problem: Clients with multiple IP addresses failing to backup.

Cause: Clients had been configured with the public name and Networker was confused as to the client’s identity.

Solution: Clients were destroyed and recreated using the client identifier to maintain save sets.

Problem: Admin network backup traffic was traveling over the public network if the storage node was the secondary backup server.

Cause: Networker identified the storage node via its public name.

Solution: A client side host entry was created to point the public name at the private IP address.

Problem: NDMP file level restore operations did not function.

Cause: Authorization had not been granted at the Networker client configuration level.

Solution: Authorization was granted.

Problem: pre_clone.sh script was being run every time a save group completed.

Cause: Script was initially used to initiate clone operations. Old notification rule set was causing the script to be executed every time a save group completed.

Solution: Notification rule was removed.

Problem: Several unusable tape drive resources were being shown as available in the Networker application.

Cause: Networker scans the scsi bus to include all devices it can see. Invalid drives were an artifact created by the udev process on our Linux backup servers.

Solution: Each device was manually configured and removed.

Problem: Volume identifiers were being corrupted in Networker.

Cause: Scanner on robot arm of the library was not correctly reading labels.

Solution: Scanner was replaced.

Problem: Tape drives would automatically disable. (not scsi reset)

Cause: Tape drives had physical defects.

Solution: Drives are being replaced as needed.

Problem: Reboot of Networker server causes the jukebox to jump SCSI addresses and results in having to rebuild the jukebox resource.

Cause: Linux only supports persistent binding at the device file level. Networker only recognizes the robot arm at the SCSI address level.

Solution (Mitigated): New procedure to change SCSI control port “on the fly” implemented.

Finally, I also went to SNIA class and learned quite a bit of information. The SNIA material is all based off of ITIL principles and will be a good source of information. I will be taking the SNIA Certified Professional test in a couple weeks.



February 2008 Activity Report

25 02 2008

Projects:

Vuspace 3:

We’ve made our recommendations and we are waiting for guidance.

Operational:

Backup, backup, backup, and more backup.  I’ve learned a TON about our environment over this month.  I’ve had to do a lot of troubleshooting and have managed to resolve some issues with clients that have been around a bit.  In addition to the day-to-day troubleshooting necessary to keep the environment running smoothly I had to completely rebuild the Jubebox on our backup server the other evening.

I migrated a DAE from Stevenson to the Hill center to increase our capacity of Tier 2 storage at the Hill Center.  I also installed a new tier 3 DAE into the Hill Center CX3-80 to facilitate a 3 TB expansion to the Celerra NAS device.

Training:  I am in training this week studying Networker 7.3 which is the software we used to backup our environment.



January 2008 Activity Report

28 01 2008

Vuspace:

  • We presented two solutions Xythos and Microsoft Windows cluster. We’re currently waiting on governance to see which path we head down.

Storage Management IP Migration:

  • I am in the process of moving all of the management interfaces of our Storage environment to our admin network. This month the SAN directors were all moved successfully, next we’ll move the Clariions and a couple other interconnected machines. I am waiting on the Exchange servers.

Operational:

  • The BlueArc suffered a parity error this month in one of the drives in a Raid 5 parity set. The apparent cause was a corruption written to the disk when a failed drive was replaced. The final fix was to find the specific disk the faulty parity data was written to and replace that disk with a new one. The rebuild from other disks caused the parity data to be regenerated fixing the issue. Our exposure during this was possible data loss if any drive, but the drive with the corrupted parity data failed.
  • The HBA’s have been swapped out from Emulex to Qlogic cards in an attempt to fix the SCSI LUN reset errors that were causing drives to be automatically set offline. A reboot was required to repair this issue. Once since the upgrade we have had tape drives automatically disable, but it was not due to a SCSI reset. We have not seen another reset since. We have identified and replaced a failed tape drive that was causing issues.
  • Routine Clariion maintenance.
  • Routine SAN maintenance.



December 2007 Activity Report

28 12 2007



November Activity Report

26 11 2007

Projects:

Vuspace 3:

In addition to the three products we are considering currently (Documentum, Xythos, and Windows Cluster) we evaluated a number of free solutions this month. We have created a feature matrix of all the products we have considered and hope to present it for final resolution sometime soon.
Operational:

  • We upgraded the firmware to the latest versions in our SAN environment this month. The upgrade of our first 9509 SAN director caused a reboot that was not expected. Luckily, we have a redundant SAN fabric in the Hill Center and no serious consequences were seen as a result. The upgrade of the second 9509 SAN director went much better. The upgrade went as expected and no outage was seen as a result on that switch. The 9216 upgrade in Stevenson went as expected, and we had no unintended impacts from the upgrade due to help from the ESX team.
  • This month we added an IBM Blade Center to our environment. The blade center included an internal Cisco SAN switch module. We ISL’d it into our environment and presented storage through it to the blade center for evaluation.
  • I was tasked with writing a script this month to gather Imap statistics. The request was for mail messages < 90 days, 90-120 days, and > 120 days old. The total size on disk for each category as well as a tally was needed. I used Perl and wrote a recursive subroutine to handle the arbitrary file structures possible within mailboxes. The whole thing worked pretty slick, all that was needed as a few additional code blocks to handle special cases such as special characters. One of the most difficult to address being mailbox sub-folders with a single character space ‘ ‘. As the full name of the sub-folder.
  • In the RMSE environment this month I tested and then Tony implemented the Admsnap upgrade that was the last hurdle before we could perform the Clariion firmware upgrades. The RMSE environment should now be compatible with FLARE 24 making our coming upgrade after the first of the year possible.
  • The rest of the month was spent doing troubleshooting on the BlueArc for various issues and routine work connecting new hosts to our Storage environment. We had multiple new ESX hosts connected to our storage environment this month. The ESX environment is now using storage from 2 frames and is our fastest growing Storage consumer in our environment currently.