Archive Page 2
Once again, my main focus for the month has been DNS with a healthy dose of DHCP.
- The Diamond IP test environment was finally successfully upgraded to 3.0.53. This will allow us to move forward with the NetID import process and get ready for DHCP roll out.
- Prerequisites Phase I, and Phase II steps for production migration successfully completed. We continue to run through the process to iron out any possible "gotcha’s" that may appear.
- OMAPI was successfully implemented and works with Diamond IP.
- DNS Views testing is halted while we finalize DHCP testing. The repeated and rapid database refreshes makes it extremely difficult to make any meaningful progress before being reset to 0.
DNS took an interesting turn this month as we finally started to do some analysis of the traffic and performance of the servers.
- What started out as curiosity, performance reporting on the name servers has become an automated process. Daily total query counts and identification of the top query sources are reported to management. Looking through this information has allowed us to identify, with sometimes very surprising results, where the majority of the DNS traffic originates. It has also opened our eyes to some concerns of acceptable usage, too.
- Next on the list is the identification of top requested domains and to automate that report as well.
- Named stats are FINALLY being collected and are almost ripe for graphing and analysis. This will give us a good idea of how the server is performing in more granular levels.
Not to be stuck feeling left alone, our ESX virtualization environment decided to rear its head and demand some attention.
- The final fault is our own, but a networking hiccup uncovered some serious design/implementation flaws in our ESX environment. The hiccup caused a cascade of network outages which made the ESX servers to think they were in an isolated state. High Availability did what it was supposed to and attempted to initiate migrations of the VM’s to other servers and to kill the running VM processes. Except… almost every ESX host thought the same thing at the same time resulting in some VM’s not moving, some moving repeatedly, some shutting down, and others just getting into a confused state.
- Fault lies not with the software or even the network, but in the lack of paying attention to detail to insure that our virtual switches were redundant to separate physical switches and all VLAN’s were correctly tagged to their associated physical ports.
- Identification and repair of the situation continues.
Other miscellaneous tasks filled up the rest of the time.
- Power/rack moves during the weekends
- Operational duties
- Patched some of the standby and test/dev database servers.
Next month is patch happy heaven and a massive ramp up to DHCP migration.
1. DIP Test Environment – It’s up and running. DHCP Migration moving forward
2. BIND Views in DIP – Initial attempt to import a working BIND views configuration failed with tons of Java errors. Trying another method.
3. Moved a bunch of servers to new rack locations.
4. Fixed various maintenance issues with list-srv1, news-srv1, Napster, and other servers.
5. Patched the DNS vulnerability
6. Filled in for PW on IDM
Primary DNS is now pumping out 28,500,000 queries in a 24 hour period.
1. BT INS aka DiamondIP aka DIP – Test environment still not up. Having multiple issues getting the database to work on the test environment. Half tempted to blow the thing away and totally rebuild this test environment.
2. Auth4 aka redundant kerb server – Built out a new kerb server to go to the hospital.
3. Gave the Med Center an overview of DIP and the functionality. They were made aware of the ups and downs of the product as well as given an insight on how it works.
4. Finally got CA to get me uncorrupted patches for Spectrum One-Click. Installed the patches and everyone is now happy that Report Manager works as it should.
5. Lended a hand on a little forensic work for IRT. Ended up taking much more time that I thought it would but was very much an eye opening experience.
6. Fixed a ton of issues on RHN. Purged a lot of non-responding servers, registered a bunch more, and finally got all the channels resync’d.
7. Did a metric boatload (as opposed to Imperial or Standard boatloads) of DNS requests/changes/fixes/etc
8. Managed to break DNS attempting to get BIND Views working…. GO ME!
May MARS
1. Moved the VCMS database off of the VCMS server to the Linux Oracle server cluster – Instead of timing out when attempting to do any performance reports greater than 3 days, we can now query and receive our reports for the past year in less than 10 seconds. This is extremely helpful in doing trend analysis of our virtual environment.
2. Cloned and moved the NDE server to Stevenson – The NDE webserver was cloned and moved to the SC datacenter cluster to provide redundancy of services. Now working on getting rsync over ssh working to insure data is correctly replicated.
3. Spectrum patching – I attempted to patch our Spectrum environment to get the Business Objects Report Manager working correctly but one of the patches was corrupted. This prevented me from finishing the patch process. The vendor has yet to supply the patch again for download.
4. Moved the Diamond IP test application server and rebuild – The test application server was moved and rebuilt. The production data was loaded on the test database and I am in the process of removing the pointers to our production DNS and pointing them towards the test application/DNS service.
5. Develop automated Virtual Environment Billing Script – Started work on writing a script that will automatically gather the billable data (procs and ram amounts) from the VMX files for co-located VM’s. Currently fighting what I call "Special Character Hell" to get around the multitudes of parenthesis, dashes, spaces, and slashes in VM names/directories.
6. Continued work/enhancement of Diamond IP Production environment - Work continues on the production DIP environment to insure stability and to get external LDAP authentication working. Additionally, RFC1918 DNS preparation continues.
Well… where to start….
I am proud to say that this should be the last month where I claim my main priority was DNS.
1. BT INS Diamond IP successfully deployed – After some false starts, some anger, some frustration, and a whole lot of fatigue, we FINALLY rolled out our replacement DNS architecture. The new system is running a fully integrated DNS/DHCP/IP Management solution and went in with Zero Downtime. Out of over 29000 records and over 430 separate domains, I have received word of only 6 individual resource record errors. Not too shabby. Currently, we are serving up over 1.5 million queries an hour without a hiccup and the transition was transparent to the community. Now I get to focus on getting BIND Views up and running and getting co-workers trained up.
2. VMware Certified Professional – Yeah… it fell through the cracks and I got gigged on failing to take it prior to my review… I’ll accept that. I will also accept that I took the test and PASSED. Now, I need to figure out which additional alphabet soups I can append to my title… (RHCE, VCP, AEIOU, etc).
3. The Solaris Oracle environment was patched up to current revs. Of course, during one of the patch sessions, SunSolve decided to send its bandwidth out to lunch and make a 4 hour patch cycle take almost 10 hours. Thanks Sun! Also, DB-1 and DB-2 received some additional space so we don’t have to get called every time a backup of the database is kicked off.
4. I really to tidy up some of my operational tasks/duties now that DNS is (mostly) done.
I would like to take this time to bid farewell to Kenon Ewing as he heads over to the storage team. He will continue to excel and his presence will be sorely missed.
Feb MARS:
Once again, it’s all been about DNS and the preparation for the March 2nd deployment…
* Cleaned DNS zone files which lead to a reduction from 10367 lines in the main vanderbilt.edu zone to 4271 lines.
* Created vanderbilt.edu sub-zones to facilitate self serve requests for major users. This resulted in another reduction of the vanderbilt.edu to 1617 lines and the creation of 18 individual sub-zones.
For the record, those 2 above activities involved manual line by line combing through the files. Talk about time consuming and tedious…
* Install and configure the BT INS Diamond IP appliances: Not as easy as it sounds. The initial build shipped with the appliances had a small error that would not allow the management stations to install from the provided USB keys. After much hair pulling, the vendor overnighted a new build for us to use to facilitate the install. Additionally, being locked into the non-elevated privledge accounts has led to much heartache when it comes to iptables, routing, and other low level configurations.
Enough about DNS… I actually do other things too… seriously…. stop laughing now….
* Continued to work on getting Spectrum updated. With the deployment of 8.1, multiple processes managed to become broken. After multiple hox fixes and patches installed, all but one seems to be resolved. The remaining issue concerns a bug with Business Objects and the Report Manager function for Spectrum. With any of the Sun X-Server patches installed, BO decides to go blah. By blah, I mean fail to work. Computer Associates (the vendor responsible for Spectrum) is waiting on Business Objects to provide some sort of patch for this issue.
* The only other notable activity for the month (I don’t include operational gunk but it’s… well… mundane operational gunk) was the re-examination of system data collection on the servers I am responsible for maintaining. Sar, for all of its pains and annoyances, is a steady standby for collecting system data. SNMP and Cacti and Nagios and blah blah blah is well and good but to rely on it exposes us to potential granular data loss. So, after spending a couple days hopping through hoops to provide CPU utilization data on systems and finding that it was (for a lack of a better term) lacking, I have jumped around and insured sar is doing its thing.
Other than that, all I have to say is "HOLY COW FEBRUARY WENT FAST!" and "HOLY COW DNS IS ALMOST HERE!"
YAY! It’s that time again!
Once again, the mantra is Dee Inn Ess with Dee Ech See Pee! Go, go Diamond IP!
* Spec’ed out the Oracle servers for DNS/DHCP. The servers will be beefy enough to hold multiple databases in addition to the DIP database. Additionally, we will FINALLY have a test/dev database environment when all is said and done. Amazing concept.
* Got my Visio groove on and knocked out the engineering diagrams for the DIP environment.
* Performed hardware maintenance on a couple of Sun servers. v210’s are the devil when it comes to fans failing in their PSU’s and curse Sun for not making them modular (or even N+1).
* Attended Sharepoint training to learn more than I ever wanted or needed to know about how to create Sharepoint sites.
* Performed metric gathering and reporting for Owen School of Management to determine proper sizing of virtual machine CPU/Memory allocation for their spamgates.
* Upgraded virtual memory & CPU allocation for the Sharepoint environment, jump servers, and Owen School of Management.
* Updated the Shibboleth certificate
This month can be summed up in 3 letters… DNS
* Moved forward towards implementation of Diamond IP for consolidated IPAM/DNS/DHCP. Attended training (worthwhile while painful) and now have a more complete idea of what we will need to do concerning migration, implementation, and support. DDNS and BIND9 views will still present the largest challenges during rollout.
* After much head pounding, replaced a NIC in an ESX server that caused a PCI reset which rebooted the server.
* Fought the good fight with the Software Store’s attempt to roll out the new RPEG utility on Linux using Tomcat. Too bad the vendor was 110% Windows oriented and even admitted that any customer using Linux was basically on their own
* Worked on the RHN Satellite upgrade. Need to get the LDAP authentication working and all will be fine there.
1. Patching of Production Solaris Servers – Phase 1
1/2 of the production Solaris servers were taken offline and patched with the latest R&S patches from Sun.
2. Patching of Test/Dev Solaris Servers
The test and development servers were patched with the latest R&S patches from Sun. This occurred during the day with no impact.
3. Compilation of ESX patches
The patches required for ESX patching have been compiled and staged on the ESX servers.
4. ESX HBA Failover Script
I researched and wrote a little script to help facilitate maintenance on the ESX servers and the SAN. A little 3 part script will manually failover HBA patchs and fix them in place so no will be no loss of data connectivity when taking down 1/2 the legs.
5. Perl, OpenSSL, and the evils of the Crypt::SSLeay module
After beating my head into a wall trying to get Crypt:SSLeay to install, I ended up having to dig through the module to find out what was causing it to fail. The module was not passing the SSL location correctly and I had to manually fix each instance.
6. DiamondIP and the Never Ending DNS/DHCP Enhancement Project
With Infoblox being axed due to GUI limitations, we decided to evaluate the solution Nortel is moving forward with to replace NetID. DiamondIP is a web-based, ISC compliant DNS/DHCP management solution that shows promise (as did Infoblox). The conversion process took longer than expected but, finally, we have a somewhat finalized database of all of the zones and DHCP data. We are moving forward into testing with the hope that this solution may finally give everyone what they need.
7. The ITS/Vanderbilt Voice Collaboration Software Service … uh… Teamspeak Server
ITS installed a Teamspeak server to help facilitate remote classroom interaction. It is currently in testing and a maintenance/support plan still needs to be worked out.
1. LDAP1 and AUTH1 physical migration – ldap1 and auth1 move – ldap1 and auth1 were shutdown during maintenance hours to perform a physical migration to a new rack. This freed up rack space to allow the data center to reclaim a rack. Despite a couple hiccups in cabling, the move was successful.
2. VMware ESX security vulnerability patching – a couple of security vulnerabilities were discovered in VMware ESX Server 3.0.x. The vulnerabilities were assessed and the patches were applied to 2 test servers to insure there would be no induced instabilities. After testing, the patches will be rolled out into the production environment.
3. RHEL4 scsi driver patch – It was identified that there existed a bug within the Buslogic SCSI driver within RHEL4-U3/4 within a virtual environment. The bug would make the OS remount the root filesystem in read only mode if there was a SCSI timeout. The issue would be that a single hiccup in the SAN environment would affect all the machine. VMware released an RPM patch to address the situation. The patched was applied to all RHEL4-U3/4 virtual machines during maintenance hours with no incident.
4. Infoblox DNS/DHCP Enhancement Project – Progress was made in the DNS/DHCP Enhancement Project with the go-ahead to proceed to taking the project to governance. The DNS portion is extremely stable and is pretty much good to go. There was a hiccup encountered (explained below) that presented a slight delay but it was surpassed and movement forward was not significantly impeded. As it stands now, there are certain limitations present in the GUI as well as with importing the DHCP/IPAM data that is preventing us from moving completely forward. Infoblox is addressing those issues.
5. LDAPB daemon issues – Multiple times in the month of April, LDAPB would stop performing authentication. This would result in degradation of service for some applications and some occasional embarrassing situations. For some reason that is yet to be determined with 100% authority, the slapd daemon will continue to use resources to a point where it no longer will pass on authentication requests. The level of resource usage is well below the maximums that box can handle. Additionally, this behavior is not present on LDAPA. While the temporary fix is relatively simple (restart slapd) and quick (about 30 seconds), with current monitoring measure, we are not aware of the issue until the community notices. Currently, I am looking into how to have a more proactive monitoring solution that will alert us at a much earlier time. Unfortunately, there remains an issue with some clients/applications being unaware of the change in LDAP state (through persistent binds) that will fail once slapd is restarted (if they were bound to LDAPB).
6. Install perl modules for ldap-dev – I-Dev identified a number of perl modules that needed to be installed on ldap-dev. All the modules were sucessfullly installed with the exception of Crypt::SSLeay which is having issues with the latest version of OpenSSL.
7. MX change for math.v.e – Performed the MX change for the math subdomain to bring them to the mailgate/proofpoint environment.
8. Investigate vendors and procure SCSI drives for med center oracle netid replacement server – With the NetID environment split with the Medical Center on the horizon, additional drives needed to be purchased for the nidora1 replacement server. Multiple vendors were contacted and quotes received. The decision was made to move with Strategic Technologies for speed of delivery. The drives were purchased and installed in the replacement server for the Medical Center.
9. NetID Environment Split – The decision was made to split the NetID environment between the University and the Medical Center. Due to the highly sensitive nature of this environment and the potential of outtage, multiple courses of action were discussed before finally deciding on building additional Oracle servers and splitting the application servers to minimize downtime and the potential for data loss. Julie Catellier was outstanding in taking the lead on this project and developing the plan.
10. Infoblox DNS Import Issue – Importation of DNS data into the Infoblox environment resulted in some odd errors that required the attention of Infoblox support. It appeared that the data was imported but with undefined error codes spaced throughout the process. Through multiple runs/imports, the situation could not be replicated and was determined to be an issue with the import agent and not the Infoblox appliance itself.
11. nde.its.v.e DST fix – ND&E needed to have a manual DST fix to a legacy RedHat box. The Osborne data was compiled by hand and the appropriate timezone data files were replaced.