Introduction
Many Csound users may have seen me referring to the SDFT or Sliding Discrete Fourier Transform in recent months. In this article I will introduce the basic ideas, and describe the Csound implementation. The main message is that it is integrated with the Richard Dobson f-signals and so is easy to use. The down side is that it is slow, and definitely not for real-time performance.The work is the product of a research project carried out by Russell Bradford (a mathematician), Richard Dobson (a musician) and myself (a hacker), and was funded largely by the Arts and Humanities Research Council under their Speculative Research theme, to whom we express our thanks.
I. The SDFT
The use of the spectral domain is well known and there is no reason to
reiterate the underlying concepts. Most people use an algorithm
called the FFT (Fast Fourier Transform) to calculate the Discrete
Fourier transform of a window of audio signal, and then perform
whatever operation on it they wish. The reconstruction of the
time-domain signal is then performed by the IFFT (Inverse Fast Fourier
Transform) algorithm, together with some averaging process.
The SDFT is an alternative algorithm, whereby the analysis window is
always moved on by one sample, and the value of the DFT for the new
window is calculated from the previous one. That such a process
should exist is not a surprise, as the FFT calculates the spectral
components in a window, and if that window is moved (slide) by one
sample, it is obvious that most of the information will remain the
same. Details can be found in many places, such as James Moorer's
Audio in the New Millennium, JAES 48(5), 2000, or more recently
in Bradford, Dobson and ffitch Sliding is Smoother than
Jumping p287--290, ICMC2005. The bottom line is that for each
frequency bin the process of moving one involves adding the new
sample, subtracting the sample that has fallen off the start of the
window, and a multiplication (rotation) by a complex value.
where t refers to time, n the bin number and N
the window size. We start from all zeros so there is no starting
problem for the first DFT.
There are a number of implications of this formulation, most obviously there is no requirement for the window size to be a power of two. Some of the more surprising implications can be found in Bradford, Dobson and ffitch's paper on the Sliding Phase Vocoder in ICMC2007. Here I just draw attention to the bandwidth of each bin, which is up to Nyquist, and so each bin can contribute to any frequency.
But the point of this article is not to discuss the mathematics, but rather the Csound implementation, and the opcodes that can use it. Before that we do need to consider an important issue, windowing.
Windowing and Sliding
The normal process is to apply an envelope window to the sample to improve frequency resolution and minimise smear. In the SDFT this cannot be done in the time domain. But it is well known that convolution in the time domain is multiplication in the frequency domain, and that is what we do. This does impose a slight restriction, that practically the window type must be built from cosines. This is acceptable for rectangular, Hamming, von Hann, Blackman (and variants), Blackman-Harris, and Nutall. It does not include Kaiser windows which is pity; we think we have a solution but it is a compromise, and is currently not implemented.We are also forced to lose the ability to provide a window as a ftable. Technically it would be possible to provide a user-window facility presented as coefficients of the cosine, but this does seem rather detailed for common use!
II. Integration into Csound
Due to earlier work by Richard Dobson, and extended by Victor
Lazzarini, Csound has a useful and well constructed streaming phase
vocoder. The process here is to construct a new DFT frame when
sufficient samples have been obtained, with a restriction that this
cannot be more that once per k-rate frame. Internally the f-variables
have a structure to maintain the bin values, window size and type and
various housekeeping data. The SDFT implementation in Csound reuses
this structure, so from an elementary point of view the introduction
of the sliding option has no syntactic change. All existing use of the
pvsXXX opcodes and f-variables should continue to work after the
introduction of sliding. Actually this is not totally true, if small
hop sizes are used with small ksmps, but I hope this is a very
small problem. This is explained later.
In essence the SDFT is a transform with an overlap parameter of 1 sample. That plus the restriction on the existing pvsanal give the clue to the way the integration has been done. In the overlap is less than the ksmps or is small (currently set to less than 10 but maybe this should be 2) the sliding format it use. This requires a transform frame for each of the ksmps, and that happens underneath the user's view. The construction of the frames then proceeds using the formula above rather than the FFT.
The reconstruction of the signal is rather different. In the original
Moorer paper in AES the inverse algorithm is very memory intensive
which looking deceptively similar to the forward transform.
Fortunately there is a better way that Russell Bradford noticed. As
we have a transform for each sample we can treat each frame as a
representation of a single sample, and then the fundamental Discrete
Fourier Transform formula
applies. There is a choice we can make; if each frame represents one
sample, which one? If we choose the oldest sample then the t in this
formula is zero, and the equation collapses to a straight addition of
the bins! This does introduce a latency of the window size of
course. By considering the middle sample to be represented by the
frame this becomes and alternating add/subtract formula. A zero
latency version is more expensive, and while we have had it
implemented in our experimental code if seems rather too much.
It should not come as a surprise that such a simple formula exists, as it is just treating the reconstruction as an oscillator bank. In the Csound SDFT implementation of pvsynth we are using the midpoint version. We could make an option for alternatives but that seems a little over-complex.
Csound Opcodes
The opcode to create a sliding DFT frame is pvsanal as before. The arguments are not changed, but if the overlap parameter is 1 or less than ksmps then sliding is use. In the manual the example given isain in ; live source ffin pvsanal ain,1024,256,2048,0 ; analyze, using HammingIf this is replaced by
ain in ; live source ffin pvsanal ain,1024,1,2048,0 ; analyze, using Hammingthe sliding mechanism will be used, and the internal structures will be different.
The reconstruct the signal the opcode pvsynth is used; if the f-signal is a sliding one then the appropriate mechanism is used.
III. Transforming in the Spectral Domain
Just translating to and from the spectral domain is not very
interesting. The value lies in the transformations. Csound has a
rich (and growing) collection of these, and they all need to be made
aware of the sliding format. In addition this opens the possibility
for a-rate changes in some cases. In this section I indicate the
progress so far in this. There are three classes; those known to
work, those believed to work but untested, and those not done.
Verified Transformations
A very common use of the spectral domain is to change the pitch of a signal. Csound provides two opcodes for this, pvscale which scales the frequency components in a harmonic way, and pvshift which shifts the frequency components, stretching/compressing its spectrum. Both these are implemented with sliding, but with one important additional feature, the the shift amount (the second argument in both cases) can optionally be an a-rate variable. This gives much finer-grained control over the transformation, and allows a kr value as required elsewhere.instr 3 al line 400, p3, 500 asig in fsig pvsanal asig,128,1,128,1 fs pvshift fsig, al, 10 acln pvsynth fs out acln endin instr 4 asig oscili 16000, 440, 1 fsig pvsanal asig,512,1,512,0 fs pvscale fsig, 1.1 acln pvsynth fs out acln endin
Another common use of the spectral freeze. Csound provides pvsfreeze which can also be used in a sliding fashion, but is otherwise unchanged
instr 1 kl line 100, p3, 1000 asig oscili 16000, kl, 1 fsig pvsanal asig,128,1,128,1 ktrig oscil 1.5, 2, 1 ; trigger ktrig = abs(ktrig) fou pvsfreeze fsig, ktrig, ktrig ; regular 'freeze' of spectra aa pvsynth fou out aa endinThere are different ways of filtering in the spectral domain. The pvstencil provides one such transformation, using a masking function table. For example
instr 2 kl line 100, p3, 1000 asig oscili 16000, kl, 1 fsig pvsanal asig,128,1,128,1 fcln pvstencil fsig, 0, 1, 1 acln pvsynth fcln out acln endinwith "f1 0 4096 10 1" in the score.
The last if the verified opcodes is pvsmix which does a spectral mix of two signals, taking the largest amplitude from each matching pair of bins
instr 5 a2 diskin2 "latedemo.wav", 1 f2 pvsanal a2,1000,1,1000,1 asig in fsig pvsanal asig,1000,1,1000,1 fs pvsmix fsig, f2 acln pvsynth fs out acln endin
Converted but Untested Transformations
It is only time that has not permitted the full checking of these opcode.Another filtering opcode pvsfilter filters one stream according to a second. The code for this is included in the version, as is the cross-synthesis opcode pvscross.
The previous version of pvscent to calculate the spectral centroid of a signal has been changed to return either a k-rate answer or an a-rate one.
pvsbin could return the a-rate values, but so far it is limited to returning the values at that start of each k-cycle.
The opcodes pvsosc, pvsmaska, pvsblur, pvsinit and pvsmooth have unverified code that may work.
pvsinfo, the information opcode has not been changed, but should work the same. It does not give an explicit statement that it is doing sliding, but the overlap value includes that.
Non-functional Transformations
The modified opcodes fall into two major categories; those which have just not yet been looked at and those that have major design questions.The first category includes pvsdemix, pvsmorph, pvspitch, pvsifd, and pvsarp.
The problematic opcodes are those that read and write, as the data from the sliding variant is much larger. Reading externally generated PVOC-EX data will not work either, unless that is generated at 1 sample overlap, and the utility to do that needs to be written. The opcodes here are pvsdiskin, pvsftr, pvsftw, pvsfread, pvsin, pvsout, pvsdisp, pvsfwrite, pvsvoc, pvsbuffer, and pvsbufread.
And I just do not know what should be done with pvsadsyn as we are already using an oscillator-bank to reconstruct.
IV. Some Thoughts
Clearly this is work in progress. The untested opcodes will be tested
soon, and the opcodes I failed to notice initially will get looked
at. What to do about read/write and external data needs some
reflection and advise.
There is also one important caveat. This process is not fast. I will repeat that: THIS PROCESS IS NOT FAST, and so do not expect to use it in real-time performance, at least until hardware improves and our next research project is funded.
More technical information about the process can be found in the slides presented at ICMC2007 which are available here with audio examples.
Acknowledgements
That to the Arts and Humanities Research Council for their confidence in our work, and for their funds. The work described here is the result of collaboration between myself, Russell Bradford a mathematician) and Richard Dobson (a musician). The words here are mine, but the insights and inspirations of my collaborators cannot be underestimated.I would also like to thank all the members of the Csound community who, knowingly or otherwise, have encouraged me in this area. I will mention especially Victor Lazzarini and Rick Boulanger in this regard, who both had a larger part than they realise.