Morph
This post is the introduction to a multi-part blog series on C/C++ audio application design and optimisation.


“Morphing” two audio sources together is a common sound design process, be it simply fading from one source to another, to transplanting the spectral qualities of sound A to sound B. In the vain of software such as S-Layer, the aim of this application - Morph, is to provide a batch DSP processing tool that can quickly transform a large set of sounds by another set to produce output files that can either be further processed or saved as source files for future morphing. These DSP procedures will range from simple linear crossfading to cross-synthesis and spectral interpolation. Due to the potentially huge data rates and CPU intensive operations such as FFT analysis required by this application's pitch, it makes for an ideal candidate for systematic optimisation, particularly single threaded optimisation and SIMD operations. Thus, the aim of this blog series is to chronicle the steps I’ve taken to polish Morph from an initial slow prototype into something much more time and memory efficient, while retaining a user friendly interface familar to sound designers.

Everything from compiling to profiling is determined by Bash scripts, largely for simplicity and portability. gcc is the compiler of choice for optimised builds, although this may change to clang in the near future (because all source code versions are being archived, swapping out performance data for recompiled builds in clang won't be an issue further down the line). The specific flags that will be required for each build are:

-g – Generates source level debug info that will be required by profilers like perf and valgrind

-O3 – Optimise to the highest degree available at the cost of higher compile times

-march=native – Specify to build according to this machine’s CPU architecture and not to a minimum baseline

To ensure a consistency across every benchmark, some constants were chosen for both the input files and how they are processed: Every audio file is a 32bit, 192kHz stereo WAV file ranging from 2 to 80 seconds, as are all the output files. Each benchmark run must generate 50 output files (thus 100 source files and read and processed). The targe platform is an Intel Core i5 3570k @ 3.4GHz with 8GB of RAM running Ubuntu 18.04 from an SSD.

$2 ./morph "./data/BOOM_A" "./data/BOOM_B" 50 ${Output50} 1 0.5 0.9                                  
                            
The benchmark.sh script demonstrates how the program is called for each run

A suite of command line tools were chosen to give a broad sense of the program’s performance beyond just cycles spent per operation:

time – Outputs the time spent from calling the program until it’s completion. Given that the aim is to process as many audio files as quickly as possible, time gives a good indication that optimisations are making a signification difference or not.

strace – Identifies any system calls made by the program along with their execution times. As Morph hinges around file read and writes and large memory allocations, strace will be useful in determining just how many system calls are truly necessary.

perf stat – As part of the overall performance analysis toolkit for Linux, stat records hardware performance counters such as total instructions and cache misses. While valgrind can provide similar information in a more granular fashion, stat makes it easy to see at a glance whether one version of the application is running leaner than another (higher instructions per cycle count for example).

perf record – Similar to stat, although runs at a source code level. When combined with perf record, it gives a clear indication of which functions are taking the most time to complete and will be useful in determining whether too much time is being spent on system calls or standard library functions and not the actual DSP codebase.

valgrind – Tracks memory allocations and matching frees to ensure no memory leaks. With the potentially large memory footprint of Morph, ensuring it’s memory handling is stable and as small as possible will be a vital concern.

cachegrind – A subset of valgrind, cachegrind counts both instruction and low level cache accesses. Crucially, it tracks cache misses and branch mispredictions, which will highlight hotspot areas in the code that need further optimisation.

Each profiling run saves the output of these tools to a plaintext file that is backed up in a timestamped archive that can be cross-referenced by the diff.sh script.

# Launch meld
meld -a -n --label="Time"   ./${SourceA}/time_${SourceA}.txt            ./${SourceB}/time_${SourceB}.txt            &
meld -a -n --label="Trace"  ./${SourceA}/trace_${SourceA}.txt           ./${SourceB}/trace_${SourceB}.txt           &
meld -a -n --label="Stat"   ./${SourceA}/stat_${SourceA}.txt            ./${SourceB}/stat_${SourceB}.txt            &
meld -a -n --label="Memory" ./${SourceA}/memory_${SourceA}.txt          ./${SourceB}/memory_${SourceB}.txt          &
meld -a -n --label="Cache"  ./${SourceA}/cache_report_${SourceA}.txt    ./${SourceB}/cache_report_${SourceB}.txt    &
Using Meld to as a text difference tracker for each benchmark result

The first version of Morph focuses solely on crossfading for the simple reason that the architecture that will support more advanced spectral processing in the future will be reliant on a foundation that ought needs to be rock solid. Keeping things simple at first will help optimisation the core functions of Morph and highlight general usability grievances.

After translating the command line arguments, the first stage of Morph is to process the two input file directories. I made some initial assumptions regarding the data layout of the application to aid in future optimisation, namely the adoption of Structur of Arrays over the conventional Array of Structures design philosophy. SoA assumes that anytime you want to access a set of data that would be packed into a struct/class, you also want to access multiples of that same data type, thus it becomes more efficient to store each element side by side in memory. Because this application processes every input and output file in one go, the SoA approach seems beneficial, although further testing will need to be carried out to determine both the value of the SoA approach and how the structs are being packed by the compiler.

typedef struct MORPH_SOURCES
{
    size_t      Count;
    size_t      WAVID[MAX_SOURCES];
    u64         TotalSampleLength[MAX_SOURCES];
    f32         *Data[MAX_SOURCES];
} MORPH_SOURCES;

Header file definition of a source, where each data element is arranged array by array (with a counter to keep track).


The ParseDirectories function relies on a lot of string manipulation to build the file paths, as well as using the DIR struct to build and loop through the directory tree supplied by the command line. Strings can often be a performance hit so even without profiling this function presents itself as a hotspot.

while((DirectoryAParser = readdir(DirectoryA)) && (DirectoryBParser = readdir(DirectoryB))) 
{
    // Check that we're not reading another directory
    // From A
    if(!strcmp(DirectoryAParser->d_name, "."))
        continue;
    if(!strcmp(DirectoryAParser->d_name, ".."))    
        continue;

    // Directory A route
    // Concatenate strings to build the full file path
    memset(FullPathA, 0, sizeof(*FullPathA)); // Reset buffer for each loop
    strcat(FullPathA, PathA);
    strcat(FullPathA, Slash);
    strcat(FullPathA, DirectoryAParser->d_name);

    // Load into database
    LoadWAVIntoSource(FullPathA, Sources, WAVs);

    // Reapeat for Directory B
    ...

}
Directory handling code that loops through every file in an input path and loads it into memory

Each WAV file is read and allocated in full as a 32bit float array, which is makes them easy to process at the cost of huge memory footprints when working at 192kHz.

// Allocate
Sources->Data[Index] = (f32 *) malloc(sizeof(f32) * Sources->TotalSampleLength[Index]);

// Copy sample data
fread(Sources->Data[Index] , sizeof(f32), Sources->TotalSampleLength[Index], File);

Allocating and freeing the whole WAV file

These buffers then have to be reallocated according to whichever source in the A-B pairing is bigger, thus invoking another round of system calls. Investigating whether reading and allocating and smaller blocks would be a worthwhile pursuit in an effort to reduce memory usage.

for(size_t i = 0; i < Sources->Count; i += 2)
{
    // Source A is bigger - realloc Source B
    if(Sources->TotalSampleLength[i] > Sources->TotalSampleLength[i+1])
    {
        u64 Padding = Sources->TotalSampleLength[i] - Sources->TotalSampleLength[i+1];
        u64 NewSize = Sources->TotalSampleLength[i+1] + Padding;
        Sources->Data[i+1] = (f32 *) realloc(Sources->Data[i+1], sizeof(f32) * NewSize);
    }
    
    // Source B is bigger - realloc Source A
    else
    ...
}

Looping through each source pair and checking whether relloactions need to made

With the data buffers ready, the application loops through every 2 sources, processing them according the DSP type (a switch statement) and storing the result in an output buffer. The actual math for a linear crossfade is straightforward, combining the summation of two audio signals with a linear interpolation, where t is the “power” of either source A or B (0.5 for equal volume, 0.0 for only source A, 1.0 for only source B):

y[n] = a[n] * t + b[n] * (1 - t)

Because the loop is utilising the same crossfade parameter (t) every time, this is a prime candidate for parallelisation (SIMD). WriteOutputToWAV simply takes this output buffer and writes to the ‘data’ portion of an empty WAV file along with the header.

for(u64 j = 0; j < OutputCount; ++j)
{
    f32 Sample      = 0.0f;
    Sample          = Sources->Data[Index+1][j] * Parameter + Sources->Data[Index][j] * (1 -Parameter);
    Sample          *= OutputAmplitude;
    OutputBuffer[j] = Sample;
}

Calculating the crossfade amplitude for each output sample

A highlight of the version 1 results is as follows.

The full result set is available here. Part 2 will focus on general purpose optimisation techniques highlighted by this first stage of benchmarking, setting the stage for SIMD operations.