Working with RStudio: multi-dimensional fitting

(This is sixth article in the series Using R for Gambet statistical analysis.[1]) RStudio is an integrated development environment (IDE) for working with R, similar to the IDEs available for most computer languages. Figure 1 shows a screenshot. There are work areas for the script editor, console and plots. The difference is that they are combined in a single convenient workspace rather than floating windows. RStudio is freely available and provides a wealth of features, so it is hard to imagine why anyone would choose not to use it. Here are some of the advantages:

  • RStudio remembers your screen setups, so it is not necessary to resize and reposition floating windows in every session.
  • The script editor (upper left in Fig. 1) features syntax highlighting and an integrated debugger. You can keep multiple tabbed documents active.
  • An additional area at upper-right shows all R objects current defined. Double clicking a data frame opens it in a data editor tab in the script window.
  • The area at the lower right has multiple functions in addition to plot displays. Other tabs invoke a file manager, a package manager for R supplements and a Help page for quick access to information on R commands.

I’ll assume you’re using RStudio for the example of this article and following ones.

RStudio screenshot

Figure 1. RStudio screenshot.

This article shows how to use R to fit a three-dimensional polynomial function to noisy data. A practical example is making an accurate interpolation of radiation dose at a point , D(x,y,z), in a GamBet calculation. Here, the dose in any single element is subject to statistic variations that depend on the number of model input particles. The idea is to get an estimate at a test point [x0,y0,z0] by finding the most probable smooth function using values from a large number of nearby elements. The approach is valid when the physics implies that the quantity varies smoothly in space (e.g, the variation of potential in electrostatics). The present calculation is the three-dimensional extension of the interpolation discussed in a previous article.

As with previous examples, we’ll need to create some test data. The measurement point is [XAvg = 23.45, YAvg = 10.58, ZAvg = -13.70] and the ideal variation of dose about the point is

D = 19.45 + 2.52*(x-XAvg) – 0.25*(X-XAvg)^2
– 4.23*(y-YAvg) + 0.36*(y-YAvg)^2
+ 1.75*(z-ZAvg) – 0.41*(z-ZAvg)^2               [1]

The spacing of data points about the measurement point is Dx = 1.00, Dy = 1.25 and   Dz = 0.90. In this case, we’ll use a computer program to record the data in a CSV file. For reference, here is the core of the FORTRAN program:

OPEN(FILE='multiregress.csv',UNIT=30)
WRITE (30,2000)
XAvg = 23.45
YAvg = 10.58
ZAvg = -13.70
Dx = 1.00
Dy = 1.25
Dz = 0.90
! Main data loop
DO Nz = -5,5
  z = ZAvg + REAL(Nz)*Dz
  DO Ny = -5,5
    y = YAvg + REAL(Ny)*Dy
    DO Nx = -5,5
      x = XAvg + REAL(Nx)*Dx
      CALL RANDOM_NUMBER(Zeta)
      Noise = 2.0*(0.5 - Zeta)
      D = 19.45 &
        + 2.52*(x-XAvg) - 0.25*(X-XAvg)**2 &
        - 4.23*(y-YAvg) + 0.36*(y-YAvg)**2 &
        + 1.75*(z-ZAvg) - 0.41*(z-ZAvg)**2 &
        + Noise
      WRITE (30,2100) x,y,z,D
    END DO
  END DO
END DO
CLOSE (30)
2000 FORMAT ('x,y,z,D')
2100 FORMAT (F14.8,',',F14.8,',',F14.8,',',F14.8)

The program creates a header and then writes 1331 data lines in CSV format. Each line contains the position and dose, [x,y,z,D]. Here are the initial entries in multiregress.csv:

x,y,z,D
18.45000076,    4.32999992,  -18.19999886,   22.14448738
19.45000076,    4.32999992,  -18.19999886,   28.94458199
20.45000076,    4.32999992,  -18.19999886,   38.34004974
...

Run R, clear the console window (Control-L) and close any previous script. Copy and paste the following text into the script editor:

rm(list=objects())
setwd("C:/RExamples/MultiRegress")
DataSet = read.csv("multiregress.csv",header=T)
# Calculate the measurement point
XAvg=mean(DataSet$x)
YAvg=mean(DataSet$y)
ZAvg=mean(DataSet$z)
# Reference spatial variables to the measurement point
DataSet$x = DataSet$x - XAvg
DataSet$y = DataSet$y - YAvg
DataSet$z = DataSet$z - ZAvg
# Perform the fitting
FitModel = lm(D~x+y+z+I(x^2)+I(y^2)+I(z^2),DataSet)
summary(FitModel)
# Plot a scan along x through the measurement point
png(filename="XScan.png")
xscan = subset(DataSet,y < 0.001 & y > -0.001 & z < 0.001 & z > -0.001 )
plot(xscan$x,xscan$D)
lines(xscan$x,predict(FitModel,newdata=xscan))
dev.off()

The first three lines perform operations that we have previously discussed: clear all R objects, set a working directory and read information from the CSV file created by the FORTRAN program. The information is stored in data frame DataSet. Use Control-R in the script editor to execute the command lines. Note that DataSet appears in the Environment list of RStudio. Double-click on DataSet to open it in a data-editor tab. Here, you can inspect values and even modify them,

The first task in the analysis is to shift the spatial values so that they are referenced to the measurement point, as in Eq. [1].

XAvg=mean(DataSet$x)
YAvg=mean(DataSet$y)
ZAvg=mean(DataSet$z)
DataSet$x = DataSet$x - XAvg
DataSet$y = DataSet$y - YAvg
DataSet$z = DataSet$z - ZAvg

We use the mean() function under the assumption that the data points are centered about the measurement position. A single command performs the linear fit:

FitModel = lm(D~x+y+z+I(x^2)+I(y^2)+I(z^2),DataSet)

The summary() command gives the following information:

            Estimate   Std. Error  Generating
(Intercept) 19.540509   0.179142    19.45
x            2.560374   0.025733     2.52
y           -4.226011   0.020587    -4.23
z            1.737097   0.028593     1.75
I(x^2)      -0.265439   0.009214    -0.25
I(y^2)       0.358165   0.005897     0.36
I(z^2)      -0.404630   0.011375    -0.41

As with all statistical calculations, a visual indicator is a quick way to gauge the validity of the fit. The following commands create a two-dimensional plot, a scan of data points and the fitting function along x at y = 0.0, z = 0.0 (through the measurement point). The plot is saved as a PNG file in the working directory.

[1] png(filename="XScan.png")
[2] xscan = subset(DataSet,y < 0.001 & y > -0.001 & z < 0.001 & z > -0.001 )
[3] plot(xscan$x,xscan$D)
[4] lines(xscan$x,predict(FitModel,newdata=xscan))
[5] dev.off()

Line [1] opens a file for PNG output, while Line5] flushes the plot information to the file and closes it. Line[2] uses the subset() function to create a data frame, xscan, that contains only data points where y = 0.0 and z = 0.0.  The first argument is the source data frame and the second argument is the criterion for picking included row. There are two noteworthy features in the command expression:

  • The symbol & denotes the logical and operation.
  • The inclusion criterion has some slack. This is to account to floating point round-off errors. If you inspect xscan with the data editor, you will note that are some values on the order of 10[-8] rather than 0.0

Line [3] plots the raw data points while the lines() command in Line 4 applies the method discussed in a previous article to plot the fitting function. Similar operations were applied to create scans through the measurement point along y and z to produce the plots of Fig. 2.

Scans of the data and fitting function

Figure 2. Scans of the data and fitting function along x, y and z passing through the measurement point.

Footnotes

[1] The entire series is available as a PDF book at http://www.fieldp.com/rintroduction.html.

[2] Contact us : techinfo@fieldp.com.

[3] Field Precision home page: www.fieldp.com.

Comments are closed.