More performance optimization.
div_floor_mod() generates inefficient code. For power-of-2 divisors, shift and mask can be used instead.