We report on our experiences with integrating GPUs as fast, parallel floating-point co-processors into the parallel FE package FEAST. Since a full re-implementation of such a package is not feasible, we identify the smoothing of an outer domain-decomposition multigrid solver as a natural entry-point for a minimally invasive integration of GPUs. We address the issue of limited computational precision with a mixed precision iterative refinement approach and present preliminary timing results on a commodity cluster enhanced with GPUs.