Recon Application

Copyright © 2012-2014 Fred Hebert (BSD 3-Clause License)

Authors: Fred Hebert ( [web site:].

Recon is a library to be dropped into any other Erlang project, to be used to assist DevOps people diagnose problems in production nodes.

The source code can be obtained from the github repo.

Included modules are:

Main module, contains basic functionality to interact with the recon application. It includes functions to gather information about processes and the general state of the virtual machine, ports, and OTP behaviours running in the node. It also includes a few functions to facilitate RPC calls with distributed Erlang nodes.
Regroups functions to deal with Erlang's memory allocators, or particularly, to try to present the allocator data in a way that makes it simpler to discover the presence of possible problems.
Regroups useful functionality used by recon when dealing with data from the node. Would be an interesting place to look if you were looking to extend Recon's functionality
Provides production-safe tracing facilities, to dig into the execution of programs and function calls as they are running.

This library contains few tests -- most of the functionality has been tried directly in production instead, and for many Erlang installs, Recon functionality should be safe to use directly in production, assuming there is still memory left to be used in the node.

To help with regular DevOps tasks, a variety of scripts has also been included in the repository's script/ directory:

Escript that relies on graphviz, and produces a dependency graph of all applications in the repository. The script can be run directly from an Erlang shell (if compiled), or as escript app_deps.erl.
Bash script to run on an Erlang crash dump as ./ <crashdump> and will extract generic information that can be useful in determining the most common causes of node failure.
Awk script to tun on an Erlang Crash dump as awk -v threshold=<queue size> -f queue_fun.awk <crashdump> and will show what function processes with queue sizes larger or equal to <queue size> were operating at the time of the crash dump. May help find out if most processes were stuck blocking on a given function call while accumulating messages forever.