Wednesday, September 13, 2006

Best practices, network outages and.....resolutions !!!

One of the first things I would advise for any IT Guy is to profile his network and system infrastructure. It is very essential that you document all and every network-enabled device. It will assist you to greatly in planning any upgradation to your company's IT infrastructure, you want to take a system offline for maintenance, OR more importantly you have to quickly and effectively resolve any outage. Some of the key things to include would be as follows:
  • Architechture of the network.

  • Public IP addresses and their mappings if NAT is assigned to devices within network.

  • Number of Routers / Switches / Firewall devices / Modems / Access Points / VoIP phones.

  • OS and firmware versions running on each of these network devices.

  • Always Backup ALL configurations on a network repository or on a DVD/CD.

  • Take snapshots of the configurations of critical resources like Firewall settings / AV Server settings / DHCP / Routers etc. It may be more than handy when you are configuring a newer device or rebuilding the device if the backup configuration is not working.

  • IP Addresses and more importantly MAC addresses of all the network enabled devices.

  • Network points/node numbers assigned to each user.

  • Remembering a series of usernames/passwords can be quite a task for anyone. Storing them in a Excel or Word with password protection is not recommended because they can be cracked. Instead, Store the Passwords/usernames in an excel file and then encrypt it with a key. You can use a tool like PGP for this.

  • Also, make sure that the employees in the company are aware of the Threats / Trends/ best practices for safe computing. For the starters, clearly communicate the IT Policies, Do's and Don'ts.

Let me share an experience with you that happened few days back in the office:

It was the day as usual at the office - corporate mails, checking router health, Backup status on the drive, Server health, MRTG graphs, Helpdesk stats.... Since ours is an R & D company lot of devices are in and out of the LAN frequently.

Around afternoon that I got a call from couple of users complaining that they are not able to access the internet, our web-based helpdesk is located in a Co-Lo and it was inaccessible as well....hmm... getting ready for another day in the paradise. First things first, I telnetted to router to see if the Internet link was down. There were no problems with the link and it was working fine. Now the slightly scary situation.....I logged into the our Unified Threat Management system to check if there were any issues with it, and was hesitantly scrolling down the list and all of a sudden i see that the number of sessions had quadrupled to around 1000 and still increasing! Under normal circumstances, the number of sessions always used to hover around 200 and the maximum supported by the device was 2000.

What could have caused this? Intrusion? Virus attack? Switch poisoning? My heart was pounding as I was
going through pages and pages of logs and at last I got hold of the culprit. It was another DHCP server on the LAN! A rouge DHCP server, you can call it.

It was not over, as the log generally tracks the MAC address only with no inbuilt functionality to capture device name.

The next thing i checked was the excel file where I had stored all the MAC addresses of the network enabled devices in the company using an Open Source utility called Angry IP scanner. The MAC address matched with one of the LAPTOP, it was a Fujitsu laptop, but it was used for testing, hmmm...kinda floating laptop. I wasnt sure who was using it and the thoughts of running three floors up and down was lingering in my mind. I checked with the QA team, Hardware team which led me to the Apps team. There, was our user blissfully ignorant about the issue he had created.

He was testing an app called winproxy, which apart from acting as a proxy also functions as DHCP server. What amazes me is the ignorance of the users when it comes to reading the Readme file or the Do's and Dont's of using an application. I did give my piece of advise to our dude, and henceforth any testing of newer equipments on our LAN will be on an isolated network.

This is what prompted me to jot down the points that helped me resolve this issue in under 15 minutes. Hope you find it usefull.