One thing though. If it is only one app which sees a regression it might be that that particular app is doing something wrong. If you have a number of apps which regress on the same kernel, then it may well be a kernel regression.
As benchmark time is limited, I would use as many PTS benchmarks as possible, but don't run each for a long time. Instead of 5 minute runs for five applications, one could use 30 second runs for 30 applications; both would be 900 seconds of benchmark.
Is this feasible?