Well, first of all, on many chip architectures an integer division of a 32-bit number takes 32 cycles, while most other integer instructions take only 1. Also, you can unroll the loop quite easily, since you know exactly how large the number can be.
And you can analyze modulo arithmetic on any base, to find a convenient relationship. So, for example, the number 17.
1 = 1 mod 17, and 1 = -16 mod 17
2 = 2 mod 17, and 2 = -15 mod 17
4 = 4 mod 17, and 4 = -13 mod 17
8 = 8 mod 17, and 8 = -9 mod 17
16 = 16 mod 17, and 16 = -1 mod 17
So, with any integer n represented by two integers i & j where n = i * 16 + j, and 0 <= j < 16, then if n = 0 mod 17, then i * 16 + j = 0 mod 17, so i * -1 + j = 0 mod 17, so if i - j is evenly divisible by 17, then n is evenly divisible by 17.
So, I choose 16 to be the base, which is even more convenient because base-16 (hexadecimal) is a convenient format to show numbers in. If you’re unfamiliar with base 16, every digit is multiplied by a power of 16 for every step left of the ones digit, and the letters a through 4 represent the numbers 10 through 15. for example, the number 482154 would be represented as 75b6a in hexadecimal, usually denoted as 0x75b6a, to make it clear that everything to the right of the “0x” is a hexadecimal number.
So, in hexidecimal, 7 + b + a = 1C, and 5 + 6 = B. And 1C - B = 11, which is 17 in base-16, so 482154 is evenly divisible by 17. That took 3 adds and 1 subtract, each of which only take one cycle to compute, compared to a divide operation that could take up to a cycle per bit.
Continuing the example, in the MIPS instruction set, with the original number in register r1, the code to check the divisibility of a number by 17 would look like this:
andi r2, r1, 0x0f0f
srl r3, r1, 16
andi r3, r3, 0x0f0f
addu r2, r2, r3
srl r3, r2, 8
addu r4, r2, r3
srl r2, r1, 4
andi r2, r2, 0x0f0f
srl r3, r1, 20
andi r3, r3, 0x0f0f
addu r2, r2, r3
srl r3, r2, 8
addu r2, r2, r3
subu r2, r2, r4
srl r3, r2, 7
andi r3, r3, 0x0001
subu r2, r2, r3
srl r3, r2, 4
xor r2, r2, r3
andi r2, r2, 0x000f
…while to do the equivalent using divide would be:
addi r2, r0, 0x0011
divu r1, r2
mfhi r2
While the first code takes 20 instructions to set r2 to 0 if r1 is evenly divisible by 17, it would take exactly 20 cycles to perform. The second code, while only 3 instructions to do the same thing, would take 34 cycles. So, a 40% speedup going from the division to the modular-math technique.
The discrepency on 64-bit architectures is even larger, since the divide would take twice as long, but the modular-math technique would only take a few more instructions.