Autonomous evaluation rollouts from three finetuned co-trained LBMs performing long-horizon and dexterous tasks