Call/Operations Cost in CPU Ticks

Ruby Call/Operation Cost in CPU Ticks

Ruby is my language of choice for general purpose programming, for its expressive power and clarity. The dynamic and generic object structure of Ruby makes it a good match for BSON and MongoDB, likewise of Python, Perl, JavaScript, etc. But the following questions are raised.

  1. What is the cost for Ruby with dynamic object data structures versus C++ with static structures?
  2. JRuby has improved performance. How do Ruby 2.0.0 YARV and JRuby 1.7.3 compare?
  3. How do we improve real-world performance?

These rounded numbers express the relative cost of operations and calls in CPU ticks. Please see the graphs below for actual measured numbers. The following are from the Intel Sandy Bridge numbers (Mac mini mid 2011).

Operation C++
Measured-Rounded
Ruby 2.0.0 Measured-Rounded JRuby 1.7.3 Measured-Rounded
branch / loop 0 110 100
intXor 1 80 50
function call 1 70 50
stack object 2
fwrite 50 610 390
fread 50 4300 460
malloc + free 270
new heap object 300 450 100
write 1500 4600 2,900
read 1500 11,000 2,700
recv | send 5000
fork + wait 200,000
system 450,000
map/hash word count 330 880 540
map/hash int to string 1,400 1,100 410
  1. Ruby common primitive operations like looping, arithmetic, and method call appear to be in the ball-park of 100 instructions, and along with IO can often have an order of magnitude or more cost for Ruby over C++. For intensive math, consider using C/C++ or Matlab.
  2. New object overhead is expensive for complex (non-simple) objects. See below for a table of expressions and objects and their associated allocation. Memory allocation is significant even with optimizations like many Ruby object slots per malloc. Optimize expression to minimize the allocation of complex objects.
  3. High-order structures like maps / hashes, lists, etc. involve malloc overhead, both in Ruby and C++. While Ruby is still significantly less efficient in general than C++, the difference for high-order operations is substantially less than the difference for common primitive operations. More complex "real-world" implementations may be even closer. The "map/hash word count" measurements show relatively smaller overhead factors for Ruby/JRuby. The "map/hash int to string" measurements actually favor Ruby and especially JRuby. See below for more detail.
  4. JRuby shows significant performance benefit over Ruby 2.0 even with the perfomance improvements from YARV as evident in these measurements. Especially note the reduced cost for a new heap object and see below for more detail. Java 7 brings with it an important new feature called invokedynamic, which greatly improves JRuby's performance on VMs that support it. However, the use of invokedynamic is off by default on Java 7. Measurements with invokedynamic are pending.
  5. For real-world performance, the answer is still the same. Write your code in the cleanest way possible, and then profile it to identify opportunities for optimization. Run A-B benchmarks of your optimized code to measure any gain and to determine acceptance, and iterate. Use the cost above to remind yourself to pay attention to IO, memory allocation, and loops. Lastly, consider a C or Java extension, but as minimal as possible for manageable maintenance.

Here's the link to C/C++ Call/Operation Cost in CPU Ticks measurements, analysis, and notes.

Note: C++ STL map is traditionally an ordered tree-based map, while hash_map and unordered_map are hash-based maps. In these tests, I used map because it has been official STL for a long time, as hash_map is not official and unordered_map is more recent than G++ 4.2. Also, hash_map requires the programmer to supply an additional include, namespace, and a hash function, and the simple hash function using string::c_str() is not efficient.

Expressions, New Objects, and Allocation - Ruby VALUE Immediate Objects

Various expressions were measured for object allocation by disabling the garbage collector (GC), collecting GC stats, running a repeated test expression, enabling and running GC, collecting GC stats again, calculating the difference and the objects allocated per iteration of the expression.

Object Class Expression Objects Allocated
FalseClass false || false 0
Fixnum 1 + 2 0
Float 1.0 + 2.0 0
NilClass nil 0
Symbol :my_symbol 0
TrueClass true && true 0
Array Array.new 1
Bignum 1 << 64 1
Class Class.new 2
Exception Exception.new 1
Hash Hash.new 1
Method method(:exit) 1
Module Module.new 1
Numeric Numeric.new 1
Object Object.new 1
Proc Proc.new{} 2
Range Range.new(1,3) 1
Regexp Regexp.new(".") 4
String String.new 1
Struct Struct.new(:x) 3
Thread Thread.new{} 7
Time Time.new 1

In Ruby, all variables are references, e.g., pointers to an object. Knowing that pointers are word aligned with low-bits set to zero, Ruby stores some built-in types with small-data sizes as an immediate object directly in the Ruby VALUE pointer, namely Fixnum, Symbol, true, false, and nil. Therefore, the expressions above with zero objects allocated indicate the an immediate object. A least-significant bit set to one denotes a 63-bit number on a 64-bit architecture. Other low-order bit values specify the encoding of the other immediate objects. On 64-bit architectures, some floats that do not require the full 64 bits can be stored as an immediate object. Symbols expressions have no allocation because there is only one global instance (that is not subject to GC).

MRI and YARV Ruby reduce malloc overhead by allocating a large number of object slots in a single malloc. Objects slots are 40 bytes on 64-bit architectures allowing small objects to be self-contained in a slot. From the measurements, it is clear that JRuby optimizes allocation.

VALUE as an Immediate Object - Extending Ruby - Programming Ruby The Pragmatic Programmer's Guide

MRI Memory Allocation, A Primer For Developers

Simple Algorithms

map/hash word count

C++

map m;
for (size_t i = 0; i < iterations; i++) {
    stringstream ss(STRING_1024);
    string word;
    while (ss >> word) {
        m[word] += 1;
    }
}

Ruby

h = Hash.new(0)
iterations.times do
    STRING_1024.split.each do |w|
        h[w] += 1
    end
end

map/hash int to string

C++

for (size_t i = 0; i < iterations; i++) {
    map m;
    stringstream ss;
    for (int j = 0; j < VECTOR_SIZE; j++) {
        ss.seekg(0);
        ss << j;
        m[j] = ss.str();
    }
}

Ruby

iterations.times do
    h = Hash.new
    VECTOR_SIZE.times do |j|
        h[j] = j.to_s
    end
end

Graphs

Sandy Bridge - Instruction Decode and uop Cache - for some clue about sub-CPU-tick measurements due to micro-ops.

mongo-performance

C++ test_call.cpp

Ruby test_call.rb