Call/Operation Cost in CPU Ticks

Even experienced programmers need a reminder about the cost of system calls.

These rounded numbers express the relative cost of operations and calls in CPU ticks measured on Intel Core 2 processors. Please see the graphs below for actual measured numbers.

Operation	Measured-Rounded
branch / loop	0
intXor	1
function call	1
stack object	2
memcpy 512 bytes	50
fread \| fwrite	50
malloc + free	400
new heap object	400
read \| write	2,500
recv \| send	5,000
fork + wait \| system	300,000

Shell (fork + wait | system) commands are really expensive. Use them only for convenience where performance is not an issue.
Socket calls are very expensive - minimize them to one per message.
Mallocs are expensive - minimize them. Resource (buffer) pooling is highly recommended.
Heap objects are expensive due to malloc. Use very inexpensive stack objects for performance and automatic memory management.
Buffered IO can be a big performance win.
Function calls and loops are inexpensive on a modern processor.

Profiling is recommended for performance tuning - see also cachegrind and kcachegrind.

Programming Efficiency Concerns

Real world example problems in the C driver

mongo-c-driver - mongo_read_response()

The above sample has several flaws in function mongo_read_response.

Socket-read-call return values are not checked for two occurrences.
Socket timeout are missed, the function continues with the wire protocol out-of-sync, resulting in garbage data, and the users program crashes. If the error code is checked and an error returned from the function, then the user can take action like closing and reconnecting.
There are three socket read calls which should be reduced to the minimum of one.
Socket calls are expensive, on the order of 5000 CPU ticks as seen in the table below, the most expensive system call that is commonly invoked. In the example above, the first two socket reads can be easily combined into one read, and the data can be copied if needed with close to no cost relative to the socket read cost.

It's most probable that the MongoDB server will send the complete response as a single send. It's most probable that the driver can get the complete response in a single recv. A simple solution would be to do single recv request into a buffer of max_message_size. A more sophisticated solution would be to use elastic buffers. Do a first recv to the current buffer size. From the message header, you now have the length that you need to know, and you can grow the buffer and recv the remainder of the message only if needed.

If the server sends more than one response as a single send and if a single recv gets more than one response, then the next obvious solution is to use stream input/output. Investigate fdopen and remember dup and flush(es) are probably needed, or you can roll your own stream IO. If you write your BSON decoder to use stream IO, you may even avoid intermediate buffers. Multithreading and large bulk 16MB documents introduce complications, but I'm certain that smart people can come up with smart solutions - that's why we are needed.

The cost of socket calls is highly significant. Note that they induce system CPU time and the cost will not be reflected in user CPU time. If you have performance requirements, please start your investigating improvements here!
An expensive and unbalanced malloc is forced, there's no possibility of pooled buffer resources externally.
The sample function forces a malloc on each call. While malloc/free is not as expensive as an IO system call, it is next with a cost of around 400 CPU ticks. Many-many programs are dominated by malloc overhead for user CPU time.

This is an unbalanced malloc which is problematic. A standard practice for correctness is to have the malloc and its matching free co-located in the same function or module to avoid memory allocation problems like memory leaks. This sample function forces the user to have a call to free this memory.

A better solution would be to not malloc internally supply buffer resources to the function call. This has the possibility of zero malloc overhead per response-read call. The buffers can be pooled and allocated once at startup or perhaps elastically on demand. This essentially eliminates malloc overhead for the buffers needed for reading a response.

Lessons re-learned

Check all socket call return values and handle all errors appropriately.
Reduce socket calls to a minimum, one per message.
Enable pooled (buffer) resources, avoid unnecessary mallocs, especially unbalanced mallocs.

Graphs

Comparison

Comparison Table - includes all processors measured above.

Sandy Bridge - Instruction Decode and uop Cache - for some clue about sub-CPU-tick measurements due to micro-ops.

mongo-performance

test_call.cpp