Lazy Physics
… in which we explore lazy sequences and common functional idioms in Clojure via the example of looking for (nearly-)coincident clusters of times in a series.
A fundamental technical problem in experimental particle physics is how to distinguish the signatures of particles from instrumental noise.
Imagine a tree full of hundreds of sparrows, each nesting on a branch, each chirping away occasionally. Suddenly, for a brief moment, they all start chirping vigorously (maybe a hawk flew past). A clustering of chirps in time is the signal that something has happened! The analogous situation occurs in instruments consisting of many similar detector elements, each generating some amount of random noise that, on its own, is indistinguishable from any evidence left by particles, but which, taken together, signals that, again, something has happened – a muon, an electron, a neutrino has left a sudden spume of electronic evidence in your instrument, waiting to be read out and distinguished from the endless noise.
This process of separating the noise from the signal is known in physics as triggering and is typically done through some combination of spatial or time clustering; in many cases, time is the simplest to handle and the first “line of defense” against being overrun by too much data. (It is often impractical to consume all the data generated by all the elements – data reduction is the name of the game at most stages of these experiments.)
This data is typically generated continously ad infinitum, and must therefore be processed differently than, say, a single file on disk. Such infinite sequences of data are an excellent fit for the functional pattern known as laziness, in which, rather than chewing up all your RAM and/or hard disk space, data is consumed and transformed only as needed / as available. This kind of processing is baked into Clojure at many levels and throughout its library of core functions, dozens of which can be combined (“composed”) to serve an endless variety of data transformations. (This style of data wrangling is also available in Python via generators and functional libraries such as Toolz.)
Prompted by a recent question on the topic from a physicist and former colleague, I got to thinking about the classic problem of triggering, and realized that the time series trigger provides a nice showcase for Clojure’s core library and for processing lazy sequences. The rest of this post will describe a simple trigger, essentially what particle astrophysicists I know call a “simple majority trigger”; or a “simple multiplicity trigger” (depending on whom you talk to).
Now for some Clojure code. (A small amount of familiarity with
Clojure’s simple syntax is recommended for maximum understanding of
what follows.) We will build up our understanding through a series of
successively more complex code snippets. The exposition follows
closely what one might do in the Clojure REPL, building up
successively more complete examples. In each
case, we use take
to limit what would otherwise be infinite
sequences of data (so that our examples can terminate without keeping
us waiting forever…).
First we create a sorted, infinite series of ever-increasing times (in, say, nsec):
times
is an infinite (but “unrealized”) series, constructed by
iterating the anonymous function #(+ % (rand-int 1000))
which adds a
random integer from 0 to 999 to its argument (starting with zero).
The fact that it is infinite does not prevent us from defining it or
(gingerly) interrogating it via take
.
(To model a Poisson process – one in which any given event time is independent of the future or past times – one would normally choose an exponential rather than a uniformly flat distribution of time differences, but this is not important for our discussion so in the interest of simplicity we’ll go with what we have here.)
Now, the way we’ll look for excesses is to look for groupings of hits
(say, eight of them) whose first and last hit times are within 1 µsec
(1000 nsec) of each other. To start, there is a handy function called
partition
which groups a series in blocks of fixed length:
We’ll rewrite this using Clojure’s thread-last macro, which is a very helpful tool for rewriting nested expressions as a more readable pipeline of successive function applications:
However, this isn’t quite what we want, because it won’t find
clusters of times close together who don’t happen to begin on our
partition boundaries. To fix this, we use the optional step
argument to partition
:
This is getting closer to what we want–if you look carefully, you’ll
see that each row consists of the previous one shifted by one
element. The next step is to grab (via map
) the first and last times
of each group, using juxt
to apply both first
and last
to
each subsequence…
… and turn these into time differences:
Note that so far these time differences are all > 1000. comp
,
above, turns a collection of multiple functions into a new function
which is the composition of these functions, applied successively one
after the other (right-to-left). partial
turns a function of
multiple arguments into a function of fewer arguments, by binding one
or more of the arguments in a new function. For example,
Recall that we only want events whose times are close to each other;
say, whose duration is under a maximum limit of 1000 nsec. In
general, to select only the elements of a sequence which satisfy a
filter function, we use filter
:
((partial > 1000)
is a function of one argument which returns true if that argument is strictly less than 1000.)
Great! We now have total “durations”; for subsequences of 8 times, where the total durations are less than 1000 nsec.
But this is not actually that helpful. It would be better if we could get both the total durations and the actual subsequences satisfying the requirement (the analog of this in a real physics experiment would be returning the actual hit data falling inside the trigger window).
To do this, juxt
once again comes to the rescue, by allowing us
to juxt
-apose the original data along side the total duration to
show both together…
… and adapt our filter slightly to apply our filter only to the time rather than the original data:
Finally, to turn this into a function for later use, use defn
and
remove take
:
smt-8
consumes one, potentially infinite sequence and outputs
another, “smaller” (but also potentially infinite) lazy sequence of
time-clusters-plus-durations, in the form shown above.
Some contemplation will suggest many variants; for example, one in which some number of hits outside the trigger “window”; are also included. This is left as an exercise for the advanced reader.
A “real” physics trigger would have to deal with many other details:
each hit, in addition to its time, would likely have an amplitude, a
sensor ID, and other data associated with it. Also, the data may not
be perfectly sorted, some sensors may drop out of the data stream,
etc. But in some sense this prototypical time clustering algorithm is
one of the fundamental building blocks of experimental high energy
physics and astrophysics and was used (in some variant) in every
experiment I worked on over a 25+ year period. The representation
above is certainly one of the most succinct, and shows off the power
and elegance of the language, its core library, and lazy sequences.
(It is also reasonably fast for such a simple algorithm; smt-8
consumes input times at a rate of about 250 kHz. This is not,
however, fast enough for an instrument like IceCube, whose 5160
sensors each count at a rate of roughly 300 Hz, for a total rate of
1.5 MHz. A future post may look at ways to get better performance.)
blog comments powered by Disqus