Side Project: A Distributed Test Runner
I’m working with a customer who, for historical reasons, has a test lab with 20-something test machines of various speed and capacity and automated through code they’ve written themselves over the years. As part of the regular build process they distribute all their automated functional tests across these machines and a test run typically takes somewhere between 2.5 to 3 hours. They want this to be as fast as possible, so to figure out the fastest way they can complete a test run they have been pre-allocating tests to machines based on historical run times. They also have some tests that can only be run by some machines because of how the machines are configured, and they tie these tests to machines via a test category attribute.
In theory, this approach seems OK; you look at the previous execution times to work out the best distribution for the next test run, but in practice it doesn’t work out like that. Machines aren’t all the same capacity so test times can vary significantly between one machine and another, and there are new tests with no history being added all the time. The end result is some machines in the lab end up being idle for around 30 minutes, meaning test runs take longer than they should.
Now the obvious solution here is to move away from all the pre-allocated, predictive approach to distributing the tests and instead simply put all the tests in a queue. Test machines then grab the next available test from the queue and execute it. This way all machines will be busy up until the point the queue is empty and the final tests are awaiting completion by other machines in the lab.
Microsoft does exactly this with TFS. Machines in a test lab have a test agent installed on them and those agents communicate with a test controller that holds a queue of tests to be executed and farms them out to the agents based on certain criteria.
Borrowing from that concept, I decided to produce a proof of concept that my customer could borrow from and incorporate into their hand coded test lab environment.
That code is now available at https://github.com/rbanks54/DistributedTestRunner and I thought I’d share it in case you were interested.
The architecture is pretty simple.
1. Test Controller
The test controller is a set of REST endpoints (built in ASP.NET Web API) and a rudimentary UI that shows the status of a test run. The controller is a console app, running as a self-hosted OWIN server. No need for IIS here.
To start a test run the end user provides a path to an MSTest based assembly either via the API or the UI. The assembly is then parsed and all the tests placed in queues based on the category attributes, waiting for agents to start requesting tests to run.
2. Test Agents
An agent simply polls the controller for a test to run. Mulitple agents can run at once.
The controller will determine what test is next from the queue and return it to the agent. The agent then kicks off MSTest, passing the test name in as an argument, gathers the test results and sends a success/fail status back to the controller to indicate the test has completed, before then asking for another test to run.
Additionally, since we’re spinning up an instance of MSTest for each test we execute (I know, it’s not efficient) we do a little extra work to merge the individual test result files from each MSTest run into a single TRX file so that when all tests for a test run are completed, we can see the results in a single file.
It all works pretty well. Not bad for a small amount of effort!
So, feel free to have a look at the code if you like, and borrow from it as you will. If you like to experiment, feel free to take the code and extend it to make it more interesting and useful. It’s open source! I’d love to see what you do with it!
Just remember it’s a proof of concept at the moment. I made an assumption that the test assembly is in the same folder as the test agent/controller, I didn’t secure the API calls. I didn’t write unit tests (I know; practice what I preach, right?). I don’t send the TRX files back to the controller after the test run completes. I could’ve used SignalR for polling instead of the simple timer loop I used. These are all things you could improve on if you wanted to try your hand at something.
Personally, I found it fun and interesting to go through the process of putting it all together and then walking through the customer through how it works and the approaches I took. Maybe you’ll find it useful too. If not, don’t worry. It was fun to write and, after all, isn’t that why we do the job we do?
Happy coding!