A major challenge in PDE software is the balance between user-level flexibility and performance on heterogeneous hardware. We discuss our ideas on how this challenge can be tackled, exemplarily for the DUNE framework and in particular its linear algebra and solver components. We demonstrate how the former MPI-only implementation is modified to support MPI+[CPU/GPU] threading and vectorisation. To this end, we devise a novel block extension of the recently proposed SELL-C-SIGMA format. The efficiency of our approach is underlined by benchmark computations that exhibit reasonable speedups over the CPU-MPI-only case.