Fibonacci calculation at peak performance
Idea
The idea of this article is to explore the Fibonacci sequence algorithms in Elixir, explore optimizations which can be done and explore how native code can help us in order to compute even faster. The main objective of the article is to compute the largest Fibonacci number as fast as possible.
Start with the Math !
Most powerful algorithms out there rely on a quite simple math principle.
$$ Fn = F_{n-1} + F_{n-2} $$
This principle means that if you know the 2 previous terms of the series you can compute the next one. So the Fibonacci series is recursive because in the definition of the series appears the series itself.
Math is not real life.
Based on the math we have seen before we can think this algorithm:
def fib(0), do: 0
def fib(1), do: 1
def fib(n) do
fib(n - 1) + fib(n - 2)
end
Let me tell you why this is a terrible idea. This algorithm is super slow, because it does extra computations which are already computed and it does not store any result.
flowchart TB;
root["$$F_{n}$$"]
f1["$$F_{n-1}$$"]
f2["$$F_{n-2}$$"]
f11["$$F_{n-2}$$"]
f12["$$F_{n-3}$$"]
f21["$$F_{n-3}$$"]
f22["$$F_{n-4}$$"]
root-->f1;
root-->f2;
f1-->f11;
f1-->f12;
f2-->f21;
f2-->f22;
As you can see from this example, the complexity of this algorithm grows exponentially $O(2^n)$. Notice that we are doing multiple times the same computation $F_{n-2}$ or $F_{n-3}$.
Improving naive solution.
One improvement we can do to the solution is to apply Dynamic programming. We can optimize the algorithm by doing memoization, so the complexity can be reduced a lot.
def fib(n) do
if :ets.whereis(:fibs) == :undefined do
:ets.new(:fibs, [:named_table, read_concurrency: true])
end
do_fib(n)
end
defp do_fib(0) do
:ets.insert(:fibs, {0, 0})
0
end
defp do_fib(1) do
:ets.insert(:fibs, {1, 1})
1
end
defp do_fib(n) do
result = :ets.lookup(:fibs, n)
if result == [] do
val = do_fib(n - 1) + do_fib(n - 2)
:ets.insert(:fibs, {n, val})
val
else
[{_n, val}] = result
val
end
end
This is the same algorithm as before, but we cache the results in an ETS table, so if we have an already computed result, we do not need to recompute it all.
We can do a simple benchmark between this implementation with memoization and the other one with not, so it demonstrates how powerful is this technique.
##### With input 1 #####
Name ips average deviation median 99th %
Elixir O(2^n) Algorithm 16.64 M 60.09 ns ±42065.44% 45 ns 69 ns
Elixir O(2^n) Algorithm Memo 1.16 M 864.46 ns ±5353.57% 792 ns 1150 ns
Comparison:
Elixir O(2^n) Algorithm 16.64 M
Elixir O(2^n) Algorithm Memo 1.16 M - 14.39x slower +804.37 ns
##### With input 20 #####
Name ips average deviation median 99th %
Elixir O(2^n) Algorithm Memo 1.51 M 0.66 μs ±9240.96% 0.54 μs 0.88 μs
Elixir O(2^n) Algorithm 0.0102 M 98.09 μs ±55.81% 96.94 μs 105.33 μs
Comparison:
Elixir O(2^n) Algorithm Memo 1.51 M
Elixir O(2^n) Algorithm 0.0102 M - 148.28x slower +97.43 μs
##### With input 45 #####
Name ips average deviation median 99th %
Elixir O(2^n) Algorithm Memo 1.48 M 0.00000 s ±8725.95% 0.00000 s 0.00000 s
Elixir O(2^n) Algorithm 0.00000 M 16.21 s ±0.13% 16.21 s 16.22 s
Comparison:
Elixir O(2^n) Algorithm Memo 1.48 M
Elixir O(2^n) Algorithm 0.00000 M - 24065479.81x slower +16.21 s
---
config:
themeVariables:
xyChart:
plotColorPalette: '#D278AA, #7C6D91'
---
xychart
title "Benchmark Memo vs No Memo"
x-axis "Algorithm" [ "O(2^N) input 45", "Memo input 45", "O(2^N) input 20", "Memo input 20" ]
y-axis "Time in microseconds" 1 --> 200
bar [1622000, -1000000, 105.33, -1000000 ]
bar [-1000000, 1.150, -1000000, 0.88]
For each input we flush the memoization cache, because it will be unfair if we leave it, since it will not compute every N at least once.
Sometimes improving is not enough.
Dynamic programming implementation is pretty fast, but it is not enough. We can get a better implementation if we rethink the naive approach to make it $O(n)$.
One trick we can do is to store the previous 2 numbers as parameters of the recursion so this way, we dont need to compute all the other results each time.
def fib(0), do: 0
def fib(1), do: 1
def fib(n), do: fib(n, 0, 1)
defp fib(1, _a, b), do: b
defp fib(n, a, b), do: fib(n - 1, b, a + b)
This implementation still uses the same mathematical principle, but it does less iterations, and it uses way more less storage since it just needs to store the 2 previous fibonacci numbers.
flowchart LR;
root["$$F_{n}$$"]
f1["$$F_{n-1}$$"]
f2["$$F_{n-2}$$"]
f0["$$F_{0}$$"]
f01["$$F_{1}$$"]
root-->f1;
f1-->f2;
f2-->f01;
f01-->f0;
What about another math approach ?
There is another mathematical formula called Binet's formula.
$$ Fn = \frac{\phi^{n} - (-\phi^{-n})} {\sqrt{5}} $$
This principle allow us to calculate any number of the sequence by just knowing the value of $\phi$ constant.
def phi_formula(n),
do: round((:math.pow(@phi, n) - :math.pow(-@phi, -n)) * fast_inverse_square_root(5.0))
defp fast_inverse_square_root(number) do
x2 = number * 0.5
<<i::integer-size(64)>> = <<number::float-size(64)>>
<<y::float-size(64)>> = <<0x5FE6EB50C7B537A9 - (i >>> 1)::integer-size(64)>>
y = y * (1.5 - x2 * y * y)
y = y * (1.5 - x2 * y * y)
y = y * (1.5 - x2 * y * y)
y
end
This implementation is in fact $O(1)$ or $O(log n)$, depends on the complexity of :math.pow function. What is the tradeoff ? This implementation is an approximation so you will be stacking error. Notice that one divided the square root of 5 is calculated with the quake 3 algorithm, another alternative would be to hardcode this constant in the algorithm.
The end ?
So... Is this the end ?
NO, because you should know that Elixir runs in the Erlang VM which probably has some overhead
while calculating. We can try to make an Erlang NIF (Native Implementation Function), to see if it is faster than just standard Elixir.
To do so, we can take a look to rustler, which is fantastic crate/elixir library.
This is the code we will be using in Rust,
#[rustler::nif]
fn fib(n: u128) -> u128 {
let mut a = 0;
let mut b = 1;
let mut i = 1;
while i < n {
b = a + b;
a = b - a;
i += 1;
}
b
}
and for the $\phi$ based algorithm,
const PHI: f64 = 1.618033988749895;
fn inv_sqrt(x: f64) -> f64 {
let i = x.to_bits();
let x2 = x * 0.5;
let mut y = f64::from_bits(0x5FE6EB50C7B537A9 - (i >> 1));
y = y * (1.5 - x2 * y * y);
y = y * (1.5 - x2 * y * y);
y * (1.5 - x2 * y * y)
}
#[rustler::nif]
fn phi_formula(n: i32) -> f64 {
((PHI.powi(n) - (-PHI).powi(-n)) * inv_sqrt(5.0)).ceil()
}
Of course the phi_formula implementation has the same issue as the elixir one since it is an approximation, it should not be super precise at large scale.
Lets talk about the Rust implementation of Fibonacci, as you can see is $O(N)$ and notice we are just using 2 variables instead of 3, this is because an arithmetic operation should be faster than storing in a CPU register. Also notice that Rust forces us to put types and,
we are using u128, so this way we can compute and store really big integer.
Wait something weird is happening
When I was testing the code for this article, I realized that sometimes Rust implementation was not consistant with the Elixir one, especially at large scale.
## Elixir impl with 100 input
iex(1)> FibRustElixir.fib(100)
354224848179261915075
## Rust impl with 100 input
iex(2)> TurboFibonacci.fib(100)
354224848179261915075
## Elixir impl with 200 input
iex(3)> FibRustElixir.fib(200)
280571172992510140037611932413038677189525
## Rust impl with 200 input
iex(4)> TurboFibonacci.fib(200)
178502649656846143791255889261670949781
It seems everything is correct, for 100, but things are getting weird for 200.
What the hell the same algorithm gives different result ?
The answer is called overflow. Since Rust is working with limited size integers it overflows for big numbers.
Why Elixir does not have this problem ? It is because Elixir integers are dynamically sized. It creates a little bit of overhead because you must do allocations to fit the integer, but it allows you to have an "infinite" integer. I recommend this read Learning Elixir: Understanding Numbers, in order to have a better understanding of what Elixir is doing.
Then I wondered, What if I can make the same Elixir behaviour in Rust ?
Thankfully, there are already super smart people in the Rust ecosystem that thought about
this problem, and I observed that there was already a crate for this num_bigint.
Another thing, one of the guys who did rustler, thought about this problem and
he integrated rustler with num_bigint, so this way you can have a translation layer with these kind of integers in Rust. Take a look the Encoder impl for BigInt, this is the translation layer for Rust BigInt to Elixir term which is an integer.
Lets take a look to the implementation with BigInt
#[rustler::nif]
fn fib_bignums(n: u128) -> BigInt {
let mut a = BigUint::zero();
let mut b = BigUint::one();
let mut i = 1;
while i < n {
b = &a + &b;
a = &b - &a;
i += 1;
}
BigInt::from_biguint(Sign::Plus, b)
}
As you can see nothing really changes, now we are using the big integer which are resizable, and allow us to store massive integers.
Lets try with this implementation and see if it gives the same result as Elixir one.
## Elixir with input 200
iex(3)> FibRustElixir.fib(200)
280571172992510140037611932413038677189525
## Rust `u128` with input 200
iex(4)> TurboFibonacci.fib(200)
178502649656846143791255889261670949781
## Rust resizable integers with input 200
iex(5)> TurboFibonacci.fib_bignums(200)
280571172992510140037611932413038677189525
YES! It works! Now we can compute and store massive Fibonacci calculations.
Theory pretty cool, but show me the numbers.
Ok ok, now is the moment you were waiting for! The benchmarks and see if the theory is confirmed.
The benchmark library I will be using is Benchee, this library allow me to compare and obtain stats for each implementation and different inputs.
I would like to explain I have divided the benchmarking in 2. Relative small inputs,
less than 100, and really big inputs from 200_000.
These are the specs of my system
Operating System: Linux
CPU Information: 13th Gen Intel(R) Core(TM) i5-1335U
Number of Available Cores: 12
Available memory: 15.30 GB
Elixir 1.18.4
Erlang 27.3.4.3
JIT enabled: true
This is the setup for low inputs.
def bench() do
Benchee.run(
%{
"Elixir O(N) Algorithm" => &fib/1,
"Elixir O(2^n) Algorithm Memo" => &dp_slow_fib/1,
"Elixir O(1) Algorithm Phi formula" => &phi_formula/1,
"Rust O(1) Algorithm Phi formula" => &TurboFibonacci.phi_formula/1,
"Rust O(N) Algorithm" => &TurboFibonacci.fib/1,
"Rust O(N) Algorithm expanding nums" =>
&TurboFibonacci.fib_bignums/1
},
inputs: %{
"1" => 1,
"71" => 71,
"100" => 100
},
parallel: 2,
after_scenario: fn _input ->
if :ets.whereis(:fibs) != :undefined do
:ets.delete(:fibs)
end
end
)
nil
end
Before you look the results you need to consider that Rust expanding nums implementation was run with the dirty scheduler in Erlang, without it is still the worst, because of the overhead it has, but not so much worse around 33x slower. Also consider invalid $\Phi$ formula implementations for the input >= 71 or input >= 75, because they start to not give a precise value after that.
As you can see here the result for input 71 is correct for 72 is not, as well in the $\phi$ elixir implementation.
### Input 71
iex(2)> TurboFibonacci.phi_formula(71)
308061521170129.0
iex(3)> TurboFibonacci.fib(71)
308061521170129
### Input 72
iex(4)> TurboFibonacci.phi_formula(72)
498454011879263.0
iex(5)> FibRustElixir.phi_formula(72)
498454011879264
iex(6)> TurboFibonacci.fib(72)
498454011879264
### Input 75
iex(12)> FibRustElixir.phi_formula(75)
2111485077978050
iex(13)> TurboFibonacci.fib(75)
2111485077978050
### Input 76
iex(14)> FibRustElixir.phi_formula(76)
3416454622906708
iex(15)> TurboFibonacci.fib(76)
3416454622906707
Here are the results for the benchmark low inputs.
##### With input 1 #####
Name ips average deviation median 99th %
Elixir O(N) Algorithm 16.38 M 61.07 ns ±45774.27% 44 ns 71 ns
Rust O(1) Algorithm Phi formula 9.15 M 109.26 ns ±4221.28% 100 ns 210 ns
Rust O(N) Algorithm 8.26 M 121.09 ns ±23343.76% 100 ns 203 ns
Elixir O(1) Algorithm Phi formula 3.31 M 302.31 ns ±136.52% 260 ns 531 ns
Elixir O(2^n) Algorithm Memo 1.19 M 838.18 ns ±5624.56% 787 ns 1101 ns
Rust O(N) Algorithm expanding nums 0.188 M 5308.94 ns ±84.14% 4395 ns 16176.06 ns
Comparison:
Elixir O(N) Algorithm 16.38 M
Rust O(1) Algorithm Phi formula 9.15 M - 1.79x slower +48.19 ns
Rust O(N) Algorithm 8.26 M - 1.98x slower +60.03 ns
Elixir O(1) Algorithm Phi formula 3.31 M - 4.95x slower +241.25 ns
Elixir O(2^n) Algorithm Memo 1.19 M - 13.73x slower +777.11 ns
Rust O(N) Algorithm expanding nums 0.188 M - 86.93x slower +5247.87 ns
##### With input 100 #####
Name ips average deviation median 99th %
Rust O(1) Algorithm Phi formula 8.09 M 123.63 ns ±263.70% 113 ns 195 ns
Rust O(N) Algorithm 4.63 M 215.80 ns ±356.51% 198 ns 419 ns
Elixir O(1) Algorithm Phi formula 3.15 M 317.82 ns ±209.58% 277 ns 615 ns
Elixir O(2^n) Algorithm Memo 1.45 M 687.71 ns ±8165.87% 587 ns 924 ns
Elixir O(N) Algorithm 0.66 M 1517.15 ns ±2172.62% 1380 ns 1928 ns
Rust O(N) Algorithm expanding nums 0.0784 M 12753.08 ns ±55.31% 13905 ns 24389 ns
Comparison:
Rust O(1) Algorithm Phi formula 8.09 M
Rust O(N) Algorithm 4.63 M - 1.75x slower +92.17 ns
Elixir O(1) Algorithm Phi formula 3.15 M - 2.57x slower +194.18 ns
Elixir O(2^n) Algorithm Memo 1.45 M - 5.56x slower +564.08 ns
Elixir O(N) Algorithm 0.66 M - 12.27x slower +1393.51 ns
Rust O(N) Algorithm expanding nums 0.0784 M - 103.15x slower +12629.45 ns
##### With input 71 #####
Name ips average deviation median 99th %
Rust O(1) Algorithm Phi formula 7.75 M 128.97 ns ±4535.63% 117 ns 230 ns
Rust O(N) Algorithm 6.54 M 152.88 ns ±19215.97% 123 ns 224 ns
Elixir O(1) Algorithm Phi formula 3.40 M 294.01 ns ±339.67% 258 ns 533 ns
Elixir O(2^n) Algorithm Memo 1.43 M 700.42 ns ±9294.91% 581 ns 904 ns
Elixir O(N) Algorithm 1.15 M 869.80 ns ±184.80% 853 ns 1259 ns
Rust O(N) Algorithm expanding nums 0.0940 M 10633.12 ns ±73.98% 10948 ns 20597 ns
Comparison:
Rust O(1) Algorithm Phi formula 7.75 M
Rust O(N) Algorithm 6.54 M - 1.19x slower +23.91 ns
Elixir O(1) Algorithm Phi formula 3.40 M - 2.28x slower +165.04 ns
Elixir O(2^n) Algorithm Memo 1.43 M - 5.43x slower +571.45 ns
Elixir O(N) Algorithm 1.15 M - 6.74x slower +740.83 ns
Rust O(N) Algorithm expanding nums 0.0940 M - 82.44x slower +10504.15 ns
---
config:
themeVariables:
xyChart:
plotColorPalette: '#B7410E, #7C6D91'
---
xychart
title "Low numbers benchmark input 1"
x-axis "Algorithm" ["Rust Phi", "Elixir Phi", "Rust O(N)", "Elixir O(N)", "Elixir O(2^N) Memo" ]
y-axis "Time in nanoseconds" 1 --> 1200
bar [210, -1000, 203 ,-1000, -1000]
bar [-1000, 531, -1000 , 71, 1101]
---
config:
themeVariables:
xyChart:
plotColorPalette: '#B7410E, #7C6D91'
---
xychart
title "Low numbers benchmark input 71"
x-axis "Algorithm" ["Rust Phi", "Elixir Phi", "Rust O(N)", "Elixir O(N)", "Elixir O(2^N) Memo" ]
y-axis "Time in nanoseconds" 1 --> 1300
bar [230, -1000, 224 ,-1000, -1000]
bar [-1000, 533, -1000 , 1259, 904]
---
config:
themeVariables:
xyChart:
plotColorPalette: '#B7410E, #7C6D91'
---
xychart
title "Low numbers benchmark input 100"
x-axis "Algorithm" ["Rust O(N)", "Elixir O(N)", "Elixir O(2^N) Memo" ]
y-axis "Time in nanoseconds" 1 --> 2000
bar [419,-1000, -1000]
bar [-1000 , 1928, 924]
The conclusion we can make here is that for really small inputs Rust NIF is worth it to use even if Elixir wins for input 1 because it is a clause in the function almost without any overhead. $\Phi$ algorithm is not so fast for low inputs so it is not worth to use. Notice that Rust expandable integers implementation has a super big overhead which is not compensate with the small input.
This is the setup I was doing for Big numbers,
As you can see here we have much bigger inputs, other functions cannot be included in
this benchmark because they will lose precision, they will take too much time or
they will take too much memory.
def bench_large_numbers() do
Benchee.run(
%{
"Elixir O(N) Algorithm" => &fib/1,
"Rust O(N) Algorithm expanding nums" =>
&TurboFibonacci.fib_bignums/1
},
inputs: %{
"200_000" => 200_000,
"500_000" => 500_000,
"1 Million" => 1_000_000,
"2 Million" => 2_000_000,
},
parallel: 2
)
end
These are the results for the big numbers benchmark.
##### With input 1 Million #####
Name ips average deviation median 99th %
Rust O(N) Algorithm expanding nums 0.0890 11.23 s ±0.09% 11.23 s 11.24 s
Elixir O(N) Algorithm 0.0798 12.54 s ±0.00% 12.54 s 12.54 s
Comparison:
Rust O(N) Algorithm expanding nums 0.0890
Elixir O(N) Algorithm 0.0798 - 1.12x slower +1.30 s
##### With input 2 Million #####
Name ips average deviation median 99th %
Elixir O(N) Algorithm 0.0238 0.70 min ±2.11% 0.70 min 0.71 min
Rust O(N) Algorithm expanding nums 0.0101 1.64 min ±0.01% 1.64 min 1.64 min
Comparison:
Elixir O(N) Algorithm 0.0238
Rust O(N) Algorithm expanding nums 0.0101 - 2.35x slower +0.94 min
##### With input 200_000 #####
Name ips average deviation median 99th %
Rust O(N) Algorithm expanding nums 3.04 328.71 ms ±0.56% 328.21 ms 333.74 ms
Elixir O(N) Algorithm 1.50 668.81 ms ±5.29% 659.64 ms 728.12 ms
Comparison:
Rust O(N) Algorithm expanding nums 3.04
Elixir O(N) Algorithm 1.50 - 2.03x slower +340.10 ms
##### With input 500_000 #####
Name ips average deviation median 99th %
Rust O(N) Algorithm expanding nums 0.44 2.25 s ±0.14% 2.25 s 2.25 s
Elixir O(N) Algorithm 0.23 4.33 s ±2.59% 4.33 s 4.43 s
Comparison:
Rust O(N) Algorithm expanding nums 0.44
Elixir O(N) Algorithm 0.23 - 1.93x slower +2.09 s
---
config:
themeVariables:
xyChart:
plotColorPalette: '#B7410E, #7C6D91'
---
xychart
title "Big numbers benchmark"
x-axis "Algorithm and input" ["E(200k)", "R(200k)", "E(500k)", "R(500k)", "E(1M)", "R(1M)" ]
y-axis "Time in seconds" 0 --> 15
bar [-1, 0.33374, -1 , 2.25, -1, 11.24]
bar [0.72812, -1, 4.43 ,-1, 12.54, -1]
---
config:
themeVariables:
xyChart:
plotColorPalette: '#B7410E, #7C6D91'
---
xychart
title "Big numbers benchmark (2 million)"
x-axis "Algorithm and input" ["E(2M)", "R(2M)"]
y-axis "Time in seconds" 1 --> 100
bar [ -1 , 98.4 ]
bar [42.6 ,-1 ]
This is really interesting result, since it seems Rust expanding nums it pays the overhead by winning Elixir to all the inputs except for 2 million, I believe why Elixir wins in 2 million input is because some sort of reallocation optimization which is not in Rust.
Conclusion
For every peak performance algorithm you should analyze your inputs and make correct assumptions like input <= 71, so this way you benefit of the approximation algorithm.
If your inputs are super large, consider using native code, instead of working in VM code. It seems to pay off the Native code, in fact Elixir is kinda cheating because Erlang is probably calling some C code to compute this kinda numbers.
Benchmark! Always test your things with stress, load and speed benchmarking, this way you can ensure that your algorithm is gonna behave how you will expect.
Of course there is much more strategies here which I did not cover, like Multi-threading, dividing the work in chunks and so on...
Hopefully you liked the article, all the code of the article is available in Github