Connecting “Impossibly” Disconnected Data – Full Story

WSM International had won a contract to migrate an entire state’s technology infrastructure to the cloud, and the project was drowning in its own data.

We had nearly 11 different large datasets tracking various aspects of the migration: computer inventories from different departments, network configurations by building, server migration status, service dependencies, backup schedules, licensing information, physical locations, network circuits, user assignments, and vendor contacts.

Each dataset lived in its own spreadsheet or database, maintained by different teams, with no common identifier linking them together. Project managers were spending 10+ minutes just to answer simple questions like “If we migrate this server next week, which computers will be affected and what network changes do we need?”

The datasets were fundamentally incomplete – a computer might be identified by only a hostname in one outdated sheet provided by the client, by IP address in another, by an official DNS FQDN in another, and by an old NetBIOS name in yet another. There was practically no master primary key that tied these identifiers together. My team was being pulled away from actual migration work to spend hours manually cross-referencing information, and senior technicians were burning valuable project time on data archaeology instead of technical work that matched their expertise and billing rate. The CTO had looked at the data himself and concluded, “We didn’t think there was anything we could do to bring this all together. I looked over all of the data myself and my only thought was to use Python, but that wouldn’t work because then it is local to my machine.”

I spent 7 to 10 hours studying the data and building a small database style search engine in Excel using nested VLOOKUP, INDEX/MATCH combinations, and conditional formulas to create relationships between the disparate datasets. If I could find a building name in one dataset and a department name in another, and both appeared in a third dataset with IP address ranges, I could create a chain of lookups that connected computer hostnames to network configurations to server dependencies.

Some of the formulas were upwards of 13 to 15 functions crammed together, essentially creating a relational database structure within Excel. The key was finding the overlapping data points – the places where two datasets shared even one piece of information in common – and exploiting those connections to bridge gaps.

Once that tested fine in my sample data, I imported that Excel sheet into Smartsheets, and outside of adjusting one minor syntax error that Smartsheets complained about, it worked flawlessly. Data lookup time dropped from 10 minutes to about 30 seconds – a 95% reduction. Project managers could now answer stakeholder questions in real-time during meetings rather than promising to “get back to you after we research that.”

When I presented this completed search functionality to the CTO, he was literally shocked. His exact words to me were: “We didn’t even bother trying to connect the data. After seeing how disconnected everything was, we didn’t think there was anything we could do to bring this all together. So serious thank you, this is incredible.”

My formulas connected the following data in a simplified and unified Smartsheet’s environment:

  • Hostname
  • IP
  • Open ports
  • Network details (see below)
  • Migration Status
  • Migration status of dependent devices / services
  • Critical Software
  • Related Departments
  • NIC Config (Static IP, DNS, Gateway, route)
  • Configuration mode for multi NIC devices
  • VLAN assignments
  • And in some cases also:
    • Security Group Rules
    • NACLS
    • Network Routes
    • Firewall Rules

The workbook became the project’s central source of truth, and for the first time in the contract, everyone had a complete view of how all the pieces fit together.