SALR clustering
5D scatter-point data
Load data
Compute data density
In this step the data is discretized to a grid (binned). The data is initially binned so that the grid completely covers the data range. Using this information the grid range is limited so that the bins have at least density_threshold
points. The data is then re-binned using this limited data range and smoothed with a slowish, but memory efficient Gaussian filter. The grid is set to have approximately nbins
along each dimension. The actual number of bins will be chosen so that the grid has the same aspect ratio as the data.
Explore/Visualize the data
Plot isosurfaces of the data density (count). Since the data is 5D, it must be projected down to 3D. Here, dimensions 1, 3, and 4 will be show. Note that the data is very dense in the middle and there are three low density regions extending outwards (and one very low density region).
Setup SALR clustering parameters
The parameters needed to run SALR clustering can all be accessed and set using the class seedPointOptions
. This class handles parameter validation as well as computing/visualizing the SALR particle interaction parameters/potential.
Set the particle initialization parameters.
- Use a uniform random point distribution. This will overlay a hyper-cubic lattice across the grid with each lattice cell having a volume equal to the volume of a hyper-sphere with radius
Wigner_Seitz_Radius
. From each lattice cell, a point is then randomly selected from the grid where the confining potential is betweenMinimum_Initial_Potential
andMaximum_Initial_Potential
.
Set confining potential parameters.
- Use a confining potential based on the data density, and scale the confining force to be 0.4 at its 90% value.
Set the particle interaction parameter values.
- Note the
Potential_Parameters
are given in data units. - The attractive extent in the solver space is 12. Try changing this value; if you decrease the value to about 10, you should see that the resulting seed-points are farther apart from each other and closer to the boundaries of the data. If you increase the value to about 15 you can see that the seed-points get closer to each and farther from the data boundaries.
- Use a Minkowski distance with an exponent of 4. This will help require that the particles are close to each other in all dimensions.
Set up replicates and minimum cluster size.
- Here we use 5 replicates and we keep any seed-point that at least 3 of the replicates produce.
Set parameters controlling execution.
- Verbose will output information on the current iteration and the expected time of completion.
- Debug will return extra information.
- Use_Parallel will determine if the iterations are run in parallel. Note each worker will need its own copy of the data. So, a computer with 4 cores needs 4 times as much memory. Unless you are running many iterations, it is likely faster to not use parallel computation due to overhead.
Compute seed points
The seed-points are simply computed by passing the binned data, the seedPointOptions
, and the data limits.
Plot the results
Plot the the final seed-points as large red dots and the seed-points from each repetition as small black dots. This can be done by creating the markers structure below and passing it to the isosurfaceProjectionPlot
function. It is also nice to project the seed-points onto the three axis planes; this can be done by setting a project
field in the markers structure to true.
- Note: The seed-points returned by
computeObjectSeedPoints
are in data units. In order to plot them, we need to convert them back into grid units.
Locate seed-points with k-means
Let’s compare the SALR clustering results to k-means. Use K=4
(since we would not expect the very small low-density point to be found) with 2 replicates, and then plot the results as large blue dots with the SALR clustering results.
Data set description
The data used in this example are features representing the amount of damage in the nuclei used throughout this work. Using immunofluorescence techniques, the cells were stained so that DNA DSBs can be directly imaged. These images were then segmented and features extracted for each nuclei. The data set in this example has five features:
- The first feature is log(IDSB/IDAPI), where the fraction gives the fraction of DNA in a nucleus that has been damaged.
- The other four features are the first four principle components (PCA) of texture and granularity features from the DSB image channel
The images giving showing DSBs are not provided with the example data of this work.
James Kapaldo EXAMPLES