Tuesday, May 14, 2013

Python vs Perl vs Lua - speed comparison

I have written the base64 encoding algorithm in Perl, Python and Lua to compare the performance in basic arithmetic and file-io operations. Perl implementation was written a long time ago and it supposed to be a one liner. That's why the script looks so obfuscated:
alex@thinkpad:~$ cat bin/base64.pl
#!/usr/bin/env perl

$/=\1;
@a=();$b=0;

sub c2 {p(@_[0]>>2);p(((@_[0]&3)<<4)+((@_[1]&240)>>4))}
sub c3 {c2(@_);p(((@_[1]&15)<<2)+((@_[2]&192)>>6))}
sub c {c3(@_);p(@_[2]&63)}
sub p {
  if($_[0]<26){$n=ord("A")+$_[0]}
  elsif($_[0]<52){$n=ord("a")+$_[0]-26}
  elsif($_[0]<62){$n=ord("0")+$_[0]-52}
  elsif($_[0]==62){$n=ord("+")}
  else{$n=ord("/")}
  print chr($n);
  $b+=1;
  if($b%76==0){print "\n"}
}

while(<>) {
  push(@a,ord($_));
  if(3==@a) {
    c(@a);
    @a=()
  }
}

if(1==@a) {
  c2(@a[0],0);
  print "=="
}
elsif(2==@a) {
  c3(@a[0],@a[1],0);
  print "="
}
if($b>0){print "\n"}
 
Python and Lua implementations are just rewrites of the Perl version. The difference is that they both can only read from the stdin whereas the Perl version can read from the stdin and also from files specified on the command line.

Python version:
alex@thinkpad:~$ cat bin/base64.py
#!/usr/bin/env python

import sys
a=[]
b=0

def p(a):
  global b
  if a<26:
    n=ord("A")+a
  elif a<52:
    n=ord("a")+a-26
  elif a<62:
    n=ord("0")+a-52
  elif a==62:
    n=ord("+")
  else:
    n=ord("/")
  sys.stdout.write(chr(n));
  b+=1;
  if b%76 is 0:
    print
  return
def c2(a):
  p(a[0]>>2);
  p(((a[0]&3)<<4)+((a[1]&240)>>4));
  return
def c3(a):
  c2(a);
  p(((a[1]&15)<<2)+((a[2]&192)>>6));
  return
def c(a):
  c3(a);
  p(a[2]&63);
  return

try:
  if sys.version_info < (3,0):
    stream=sys.stdin
  else:
    stream=sys.stdin.detach()

  while True:
    s = stream.read(1)
    if not s:
      break; # eof condition
    a.append(ord(s))
    if len(a)==3:
      c(a);
      a=[]

  if 1==len(a):
    c2([a[0],0,0]);
    sys.stdout.write("==");
  elif 2==len(a):
    c3([a[0],a[1],0]);
    sys.stdout.write("=");
  if b is not 0:
    print
except KeyboardInterrupt:
  sys.stdout.flush()
  pass
 

and the Lua version:
alex@thinkpad:~$ cat bin/base64.lua
#!/usr/bin/env lua

function p(a)
  if a<26 then
    n=string.byte("A")+a
  elseif a<52 then
    n=string.byte("a")+a-26
  elseif a<62 then
    n=string.byte("0")+a-52
  elseif a==62 then
    n=string.byte("+")
  else
    n=string.byte("/")
  end
  io.write(string.char(n))
  b=b+1
  if (b%76)==0 then
    io.write("\n")
  end
end

function c2(a)
  p(bit32.rshift(a[1],2))
  p(bit32.lshift(bit32.band(a[1],3),4)+bit32.rshift(bit32.band(a[2],240),4))
end

function c3(a)
  c2(a)
  p(bit32.lshift(bit32.band(a[2],15),2)+bit32.rshift(bit32.band(a[3],192),6))
end

function c(a)
  c3(a)
  p(bit32.band(a[3],63))
end

a={}
b=0

while true do
  local s = io.read(1)
  if s == nil then break end 
  table.insert(a,string.byte(s))
  if 3==#a then
    c(a) 
    a={}
  end
end

if 1==#a then
  c2({a[1],0,0});
  io.write("==")
elseif 2==#a then
  c3({a[1],a[2],0});
  io.write("=")
end
if b > 0 then
  io.write("\n")
end
 

First of all tests to check if implementations are correct:
alex@thinkpad:~$ dd if=/dev/urandom bs=1M count=10 of=/tmp/10M
 
alex@thinkpad:~$ diff -s <(base64 /tmp/10M) <(bin/base64.pl /tmp/10M)
Files /dev/fd/63 and /dev/fd/62 are identical
 
alex@thinkpad:~$ diff -s <(base64 /tmp/10M) <(cat /tmp/10M|bin/base64.py)
Files /dev/fd/63 and /dev/fd/62 are identical
 
alex@thinkpad:~$ diff -s <(base64 /tmp/10M) <(cat /tmp/10M|bin/base64.lua)
Files /dev/fd/63 and /dev/fd/62 are identical

I should note that Python script haven't passed the test with the Python3 interpreter and that's why I skipped its results. There must be some problem with detaching the stream and this is the only way I know how to solve unicode problem with the stdin.read() function.

And here are the results of the measuring:
alex@thinkpad:~$ dd if=/dev/urandom bs=1M count=10 | bin/base64.pl > /dev/null
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 13.9454 s, 752 kB/s
 
alex@thinkpad:~$ dd if=/dev/urandom bs=1M count=10 | bin/base64.py > /dev/null
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 19.1276 s, 548 kB/s
 
alex@thinkpad:~$ dd if=/dev/urandom bs=1M count=10 | bin/base64.lua > /dev/null
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 20.941 s, 501 kB/s
 
alex@thinkpad:~$ dd if=/dev/urandom bs=1M count=10 | base64 > /dev/null
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.808636 s, 13.0 MB/s
 
alex@thinkpad:~$ dd if=/dev/urandom bs=1M count=10 | openssl base64 > /dev/null
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.765342 s, 13.7 MB/s

[Image]

Conclusion: Perl version is the fastest one, Python2 version is second and the Lua version is the slowest one! Still Perl version is more than 10 times slower than compiled C version

Versions of interpreters:
alex@thinkpad:~$ python
Python 2.7.4 (default, Apr 19 2013, 18:28:01) 
[GCC 4.7.3] on linux2
 
alex@thinkpad:~$ perl --version
This is perl 5, version 14, subversion 2 (v5.14.2) built for x86_64-linux-gnu-thread-multi
 
alex@thinkpad:~$ lua -v
Lua 5.2.1  Copyright (C) 1994-2012 Lua.org, PUC-Rio


Continue to the Part 2
Continue to the Part 3

15 comments:

Anonymous said...

Hi Alex. I understand it's hard to find a good piece of code to use in a benchmark, but you picked one that almost guaranteed that Lua will lose. Lua internally represents ALL numbers as floating point:

http://lua-users.org/wiki/NumbersTutorial

Therefore, any test that performs thousands of bit operations will *definitely* run very slowly in Lua.

I'm not trying to defend Lua; I'm just saying that this is a very poor indicator of its speed. Pick a different test and you may see very different results.

All the best,
Mike L.

Anonymous said...

Anonymous got a point.
Plus, could you test and provide your results with LuaJIT ? It may be significantly faster than the standard interpreter.

ZA said...

I do not doubt that Lua is a great language, but as an old timer I have my reservation about languages that do any of these things:
a. all numbers are (at least by default) floating point (as if integers are something of lower value (examples: Lua, Javascript, PL/I)
b. everything is an object (as if primitive do not exist) (examples: Ruby)
Thus, C, C#, Java, Perl and Python (and even COBOL) are still the norm, with the exception of Javascript
ZA

Unknown said...

Hi

I ran the benchmark on my virtual machine and I saw that you violated (harsh word, I know) one of the Lua principles:

Make variable access as local as possible.

Any call of bit32.rshift (et others) makes Lua do a lookup in the global table and find the functions in the sub tables. Because there are so many calls to these functions that gets expensive. All I did add this to the top of the file:


local lshift,rshift,band = bit32.lshift,bit32.rshift,bit32.band
local char,byte = string.char,string.byte
local insert = table.insert
local write,read=io.write,io.read

and replace all calls to bit32.band with band and all others respectively.



Here is the rundown. b64.lua has the local variables base64.lua doesn't:

$ dd if=/dev/urandom bs=1M count=10 | lua base64.lua > /dev/null
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 22.7633 s, 461 kB/s

$ dd if=/dev/urandom bs=1M count=10 | lua b64.lua > /dev/null
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 18.6338 s, 563 kB/s

That puts it in Python territory or actually better than), and it doesn't even optimize it for the Lua way of handling things. Bottom line I think is, one always should optimize the algorithm and more importantly data structures for the language.

Anonymous said...

As Tobas noted, global lookups are exceedingly slow and will kill timing. Localize all variables and pass around...like so:

local stage4 = function(a, iow, ti, sc, by, f, v1, v2, v3, v4, v5)
local n

if a < 26 then
n = v1 + a
elseif a < 52 then
n = v2 + a - 26
elseif a < 62 then
n = v3 + a - 52
elseif a == 62 then
n = v4
else
n = v5
end

iow(sc(n))
f = f + 1

if f % 76 == 0 then
iow("\n")
end
end

local stage3 = function(x, y, z, s2, s3, s4, rs, ls, ba, iow, ti, sc, by, f, v1, v2, v3, v4, v5)
s2(x, y, z, s3, s4, rs, ls, ba, iow, ti, sc, by, f, v1, v2, v3, v4, v5)
s4(ls(ba(y, 15), 2) + rs(ba(z, 192), 6), iow, ti, sc, by, f, v1, v2, v3, v4, v5)
end

local stage2 = function(x, y, z, s3, s4, rs, ls, ba, iow, ti, sc, by, f, v1, v2, v3, v4, v5)
s4(rs(x, 2), iow, ti, sc, by, f, v1, v2, v3, v4, v5)
s4(ls(ba(x, 3), 4) + rs(ba(y, 240), 4), iow, ti, sc, by, f, v1, v2, v3, v4, v5)
end

local stage1 = function(x, y, z, s2, s3, s4, rs, ls, ba, iow, ti, sc, by, f, v1, v2, v3, v4, v5)
s3(x, y, z, s2, s3, s4, rs, ls, ba, iow, ti, sc, by, f, v1, v2, v3, v4, v5)
s4(ba(z, 63), iow, ti, sc, by, f, v1, v2, v3, v4, v5)
end

local main = function()
local s1 = stage1
local s2 = stage2
local s3 = stage3
local s4 = stage4
local rs = bit32.rshift
local ls = bit32.lshift
local ba = bit32.band
local by = string.byte
local i = 0
local f = 0
local v1 = string.byte("A")
local v2 = string.byte("a")
local v3 = string.byte("0")
local v4 = string.byte("+")
local v5 = string.byte("/")
local fX, fY
local su = string.sub
local sc = string.char
local ti = table.insert
local iow = io.write

while true do
local x = io.read(1)
local y = io.read(1)
local z = io.read(1)

if x == nil or y == nil or z == nil then
fX = x
fY = y
break
end

s1(by(x), by(y), by(z), s2, s3, s4, rs, ls, ba, iow, ti, sc, by, f, v1, v2, v3, v4, v5)
end

if fX ~= nil then
if fY ~= nil then
s3(by(fX), by(fY), 0, s3, s4, rs, ls, ba, iow, ti, sc, by, f, v1, v2, v3, v4, v5)
iow("==")
else
s2(by(fX), 0, 0, s3, s4, rs, ls, ba, iow, ti, sc, by, f, v1, v2, v3, v4, v5)
iow("=")
end
end

if f > 0 then
iow("\n")
end
end

main()

That runs in about 8.5 seconds using luajit (using luajits bit library required and set to a variable named bit32) giving it 1.2MB/s.

alex said...

yeah, you've managed to turn a very simple routine into unreadable mess. Now tell me how it's going to work with a bigger project? What's about maintainability? You don't actually think that these tests are only about silly numbers, do you?

Unknown said...

Hello.
i really interested in your information. i'm searching for script language for studying myselft. then i saw your article. thank you for give me some information. :)

i have some question. can you tell me what is your computer spec?
(at this comparison)
i really hope to know about it.
plz send me e-mail : ysm901124@gmail.com

thank you. god bless you

Thomas Martin Klein said...

Also memory-time tradeoff is also something you might want to do.

Mike L:
Your are right about the numbers, so let's not use them :)

Ok Alex's implementation on my 1.8Ghz i3 does this:

[niki@NN210A]$ dd if=/dev/urandom bs=1M count=10 | lua b64_o.lua > /dev/null
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 43.9401 s, 239 kB/s

Which means bytheway. That he has atleast a 2.5 Ghz Processor :)

Ok. My implementation does this:
[niki@NN210A]$ dd if=/dev/urandom bs=1M count=10 | lua b64.lua > /dev/null
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 10.9582 s, 957 kB/s

But it is an entirely different aproach. And it uses a 12bit lookup table to avoid all the math. It also user 1.6MB of RAM (this one will be used on a router, so i wouldnt use much more than that). What it does.. it encodes the read block in hex, then maps each 3 HEX character to 2 base64 chars.

Here is the code. Its a mess, as i am planing to further develop it, so you can actually set what level of tradeoff to have. (i am also quite new to lua)

#!/usr/bin/lua
# (c) Thomas Martin Klein 2013

local b64table = {
'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H',
'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P',
'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f',
'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
'o', 'p', 'q', 'r', 's', 't', 'u', 'v',
'w', 'x', 'y', 'z', '0', '1', '2', '3',
'4', '5', '6', '7', '8', '9', '+', '/'
}

local block = 384
local fstr = string.rep('%02X',block);
local csize = 3 --12bit
local csc = 2^(4*csize)
local mfloor = math.floor
local sformat = string.format
local ssub = string.sub
local lut = {}
for i=0,csc-1 do
local indx = sformat("%03X",i)
local big = mfloor(i/64)+1
local small = (i%64)+1
--print(indx..' '.. b64table[big] ..b64table[small])
lut[indx]=b64table[big]..b64table[small];
end

while true do
local bytes = io.read(block)
if not bytes then break end
local bc = #bytes
if bc<block then
fstr = string.rep('%02X',bc);
end
local hexstring = sformat(fstr,bytes:byte(1,bc))
local b64string = ""
for i=1,bc*2,3 do
key = ssub(hexstring,i,i+2)
if #key == 3 then
b64string = b64string .. lut[key]
elseif #key == 2 then
local number = tonumber(key,16)
local top = mfloor(number/4)+1
local bottom = (number%4)*4+1;
b64string = b64string .. b64table[top] .. b64table[bottom] .. "=="
elseif #key == 1 then
local number = tonumber(key,16)*4+1
b64string = b64string .. b64table[number] .. "="
end
end
io.write(b64string)
end

My actiual points are. That when you use a certain language, you have to think inside that. (i am sure there is still room for improovment in my code to). And that you do not have to discard memory as a usable resource (except if you do not have it in a situation )

Anonymous said...

What's the point of a performance comparison of terribly inefficient code? You're reading files one byte at a time, repeatedly making the same function calls without caching results, and creating lots of extra objects.

If you're concerned with efficiency, then you need to start with efficient code. There's no use in comparing bad code in one language to bad code in another language. Shame on Google for placing your post so high my search results.

alex said...

My good anonymous friend doesn't seem to understand the sole purpose of this testing. The code needs to be inefficient to see how different engines deal with it.

varun sharma said...

And i thought lua was fastest...

xaver said...

Interestingly, the upcoming Lua version 5.3 will have integers as well as bitwise operators, which will improve readability and performance.

The above code can then be written as:

function p(a)
if a<26 then
n=string.byte("A")+a
elseif a<52 then
n=string.byte("a")+a-26
elseif a<62 then
n=string.byte("0")+a-52
elseif a==62 then
n=string.byte("+")
else
n=string.byte("/")
end
io.write(string.char(n))
b=b+1
if (b%76)==0 then
io.write("\n")
end
end

function c2(a)
p(a[1]>>2)
p(((a[1]&3)<<4)+((a[2]&240)>>4))
end

function c3(a)
c2(a)
p(((a[2]&15)<<2)+((a[3]&192)>>6))
end

function c(a)
c3(a)
p((a[3]&63))
end

a={}
b=0

while true do
local s = io.read(1)
if s == nil then break end
table.insert(a,string.byte(s))
if 3==#a then
c(a)
a={}
end
end

if 1==#a then
c2({a[1],0,0});
io.write("==")
elseif 2==#a then
c3({a[1],a[2],0});
io.write("=")
end
if b > 0 then
io.write("\n")
end

marc said...

Sorry Alex, but I think Anonymous hit the nail on the head. What is this comparison actually trying to achieve? Are you trying to say that perl gets the best performance when directly transcribing c code? If so, that's probably true. Are you tring to say that lua performance is mediocre? If so, you first need to rewrite the lua code so that it runs efficiently. Your code doesn't.

Using (http: // stackoverflow.com /questions/342409/how-do-i-base64-encode-decode-in-c) as the reference c code, I have to say that c is clearly king as far as readability goes. Nothing comes close. Compared to this, perl is readable but polluted with sigils, python is verbose and lua might be readable on a good day. But there is nothing stopping us from writing lua that almost exactly mimics the c.

Anonymous said...

well, as an additional point...@Alex, I ran the Lua code you provided on LuaJIT (instead of the Lua interpreter), and I got 1.3 MB/s in 8 seconds.

Python was 400 kB/s in 26 seconds.

Then, using the implementation provided by Anonymous, the code got 1.6 MB/s in 6.6 seconds.

So, LuaJIT is still blazingly fast.

Grundoko said...

I know this post is a bit old, but I re-ran your tests using Perl, Python3, Python2, Lua5.1, Lua5.2, and LuaJIT. My results were quite different from yours.

Perl 16.6879 628kB/s
Python3 29.0687 361kB/s
Python2 26.1504 401kB/s
Lua5.1 18.0574 581kB/s
Lua5.2 20.6259 508kB/s
LuaJIT 8.58338 1.2MB/s

Without any modification to the code, LuaJIT runs the code in half the time of Perl. It seems at least in this case, the state of Lua has strongly improved compared to Perl. At least with the LuaJIT interpreter. Though Lua5.1 runs it almost as fast as Perl.