Operating in Areas Contested by Terrorists:
eHealth is the prime contractor for the Gates Foundation in Northern Nigeria. The challenges with this work- there is a lot of technology involved in the fight against Polio- were not merely technical; the region was plagued by terrorism.
eHealth’s director was upfront with me: there’s no security here, nor would he purport to offer an guarantees of any. Before Skyping with him, I did some Googling and found a YouTube video- since removed- showing women wailing in a courtyard filled with bodies. These were the National Police massacred by Boko Harem at their Kano HQ located about 5 minutes from where I’d be living. When I made my way to the flight, I did so with eyes wide open as to the challenging security situation.
Shortly after arriving in Kano, Boko Harem murdered 30 people in (2) car bombings not far from where I was. The risk being no longer theoretical nor distant, one of the eHealth staff flew back to Atlanta after that. And it wasn’t long before I learned Boko Harem were sawing people’s heads of- with tree saws.
And when you’re the only white person around for a gazillion miles working 100 feet up a mast, it’s pretty difficult to be discrete. And of course falling from the mast is not impossible: step on a part with a brittle weld and you can be in freefall if you don’t have a sufficient grip on the laterals.
When planning and executing work, I never asked my Nigerian colleagues to go anywhere I would not go, not to do anything I myself would not do. I always considered and placed their safety above my own. I was acutely aware that whatever happened to me were we caught, my colleagues being Muslims in the company of a white Infidel that their fate would be much more grim than my own.
Having said all that, the Northern Nigerians are just the most decent, moral and kind folks one could ever hope to meet and I felt privileged to work and live amongst them.
Case 1: Zabbix Monitoring Saved Infrastructure from Total Loss
Abuja had modern infrastructure being the capitol of Nigeria, so all eHealth Nigeria’s servers lived down there where users wrote date and accessed applications across the large network. Also, there was less risk of loss due to the effects of terrorism; Abuja was largely secure compared to the North. But one thing that is problematic everywhere in Nigeria is reliable power. Even in Abuja the infrastructure was reliant upon generators.
I asked the EOC manager there what happened if the backup generator failed, to which they responded that it’d never happened and was not a possibility worth worrying about. However, a good infrastructure person is by their nature a pessimist and always plans for the worst-case scenario…
Notwithstanding what I was told, I proceeded to configure monitoring on the servers’ IPMI interfaces -in addition to other monitoring- in Zabbix.
A few months later while in Kaduna, a flood of alerts hit my phone: all the servers were all HOT.
A call to the security guard at the Abuja EOC revealed BOTH the primary AND the backup generators failed concurrently- just 10-15 minutes after all the staff left for the evening; it was a perfect storm.
The cooling running on mains electricity was DOWN, but not all the servers connected to UPS were UP and puking heat. I instructed the guard to immediately open all the doors and windows and vent the heat.
Had I not configured the monitoring on the servers’ IPMI interfaces, disaster was assured. More than anything, I view that decision to do hardware monitoring on the key assets as one of the most important things I did for eHealth.
Case 2: Using Zabbix to Isolate Faults on a Large Complex Network
The eHealth Network spanned the Northern half of Nigeria; it was a large and complex network connecting Emergency Operations Center’s in remote parts of the country. Extreme weather- heat, wind & storms could affect the connectivity. By planning the monitoring, I was able to identify reliability and broken connectivity issues quickly and restore network services to hundreds of users scattered across the North. This would not have been possible without prior extensive monitoring experience to configure enterprise monitoring solutions such as Zabbix.
Even if it’s possible- however inefficient it might be- pings and traceroutes will only prove paths through a network: they will NOT identify reliability issues because there’s no historical data captured. Zabbix accumulated this data which made it easy to identify reliability issues and correlate them to events.
Zabbix has come in leaps and bounds in the past 9 years since it saved my bacon in Africa and is an indispensible tool for operating large complex IT infrastructures.
Before sending an EOC Tech out to their assignments, I’d given them practical training in the Kano EOC and with my excellent colleague Mukhtar. I’d go through all the key skills they needed, and we’d given the trainee practice in the Kano EOC. A large part of the training was of course networking, working with a 1U local server and printers.
Mukhtar and I would introduce faults into the network and observe how they worked to isolate the cause of the fault. We would increase the complexity of the faults, and introduce multiple faults for the trainee to work through. By the time we were through, they had a high level of self-confidence they could succeed and we had confidence they would succeed to. BTW, the young lady in this picture below was not hired because of gender quotas: she was hired because she was by far the sharpest candidate out of a fairly large field. She earned her appointment; nothing was given to her. Although not a network guru, she was previously a software developer and so had substantial IT experience and possessed excellent analytical & logical skills.
And while I was training the EOC trainees with Mukhtar, I was training him to be a trainer to ultimately carry on after I returned to the UK.
Hardware & Drivers
One of the most serious- and pressing- problems I faced was the Polio Programme’s DB server tipping-over randomly.
The logs offered no clues, but we remarked that when a new vaccination drive kicked-off the problem would manifest.
It was a show-stopper of a problem threatening the Programme, and even the developers in Geneva who configured the server had no ideas how things were breaking.
My troubleshooting isolated the fault down to the BIOS and its’ “Green” powersaving settings. When activity dropped, the box would spin-down its’ various interfaces to reduce power consumption. That part of the “Green” power settings worked correctly.
HOWEVER: when activity picked-up again, badly written hardware drivers didn’t play well with the OS and the DB server would tip-over.
I disabled the “Green” powersaving settings in the BIOS and the DB server was now ultimately performant and reliable.