Generating arbitrary distributions with R

(This is tenth and final article in the series Using R for Gambet statistical analysis.[1]) The R package includes functions to create a wide array of distributions of interest in mathematics. Nonetheless, there are many particle distributions of interest in physics (such as differential cross sections) that are not directly represented. This article introduces a method to sample from any probability function that can be approximated by a set of points.

Let’s start by creating some data to work with. We’ll model the kinetic energy distribution of positrons emitted in the β decay of Na22. The maximum value is 540 keV. To approximate the probability mass density, we’ll use one half cycle of a sine function skewed toward higher energies to represent Coulomb repulsion from the product nucleus. Copy and paste the following commands to the RStudio script editor:

xmax = 540.0
nmax = 50
MassDens = function(x) {
  Out = sin(x*pi/(xmax))*(0.35+0.4*x/xmax)
xval = c(seq(from=0.0,to=xmax,length.out=(nmax+1)))
mdens = numeric(length=(nmax+1))
for (n in 1:(nmax+1)) {
  mdens[n] = MassDens(xval[n])

The commands

MassDens = function(x) {
  Out = sin(x*pi/(xmax))*(0.35+0.4*x/xmax)

define a function MassDens that follows the curve of Figure 1. The command

xval = c(seq(from=0.0,to=xmax,length.out=(nmax+1)))

creates a vector of 51 points equally spaced along the energy axis from 0.0 to 540.0 keV. The commands

mdens = numeric(length=(nmax+1))
for (n in 1:(nmax+1)) {
  mdens[n] = MassDens(xval[n])

create an empty vector to hold the probability mass density p(x) and fill the values with a loop that uses the MassDens function. At this point, the relative probability is sufficient — it is not necessary to worry about normalization.

Positron spectrum

Figure 1. Positron spectrum. The relative probability mass density is represented by 51 points.

To assign energy values for a particle distribution, we need the cumulative probability distribution, defined as

P(x) = ∫ p(x’)dx’

from xmin to xmax. The function P(x) equals the probability that the energy is less than or equal to x. It has a value of 0.0 at xmin and 1.0 at xmax. The following commands use the trapezoidal rule to perform the integral:

dx = xmax/nmax
cdens = numeric(length=(nmax+1))
cdens[1] = 0.0
for (n in 2:(nmax+1)) {
  cdens[n] = cdens[n-1] + (mdens[n-1]+mdens[n])*dx/2.0
cdens = cdens/cdens[51]

Note that the final command normalizes the function. The result is a vector cdens of 51 points to represent the cumulative distribution.

The cumulative probability P(x) is a monotonically-increasing function of x — there is a unique value of x for every value of P(x). Therefore, we can view x as a function of P(x), as illustrated by Fig. 2 created with the command:


We can also make an accurate calculation of x for any P(x) using the spline interpolation functions of R. The line in Fig. 2 was created with the command:

Plot of x versus P(x) (the cumulative distribution function)

Figure 2. Plot of x versus P(x) (the cumulative distribution function) showing data points and a spline interpolation.

Figure 3 illustrates the principle underlying of the sampling method. Suppose we want to create 10,000 particles. By the definition of the cumulative probability distribution, a total of 1000 of the particles should be contained within the interval 0.6 ≤ P ≤ 0.7. The graph show that these particles should be assigned energies in the range 320 keV ≤ x ≤ 360 keV. In other words, if ζ represents a uniform sequence of 10,000 numbers from 0.0 to 1.0, then the desired distribution will result if the energy values are assigned according to

x[n] = InvP(ζ[n])

Energy values associated with intervals of the cumulative distribution function

Figure 3. Energy values associated with intervals of the cumulative distribution function.

The R expression for the assignment operation uses a form of the spline() function that creates values at specified points:

NMax = 10000
dpoints = numeric(length=NMax)
ztemp = spline(cdens,xval,xout=zetavals)
dpoints = ztemp$y

In this case, the spline() function returns a data frame containing both the independent and dependent values. The assigned energies are set equal to the dependent values, dpoints = ztemp$y. The top graph in Figure 4 shows a histogram of the resulting distribution.

Histograms of the energy spectra of 10,000 particles

Figure 4. Histograms of the energy spectra of 10,000 particles.

The routine creates the same distribution each time the script is run. In some circumstances, you may want to add variations so that runs are statistically independent. In this case, the uniform sequence of ζ values may be replaced with a random-uniform distribution:


The lower graph in Fig. 4 shows the result.

In conclusion, the R package has a vast set of available commands and options that could occupy several textbooks. In this tutorial, I’ve tried to cover the fundamental core. My goal has been to clarify the sometimes arcane syntax of R so you have the background to explore additional functions. A compendium of scripts and data files for the examples is available. To make a request, please contact us.


[1] The entire series is available as a PDF book at

[2] Contact us :

[3] Field Precision home page:


Creating GamBet SRC files with R

(This is ninth article in the series Using R for Gambet statistical analysis.[1]) The focus of this article is to learn how to generate particle distributions with R and how to export them in a form that can be used directly as input to GamBet. A second topic is how to run R scripts directly from the Windows command prompt or from batch files. With this capability, you can write scripts for automatic analyses of output from GamBet and other technical programs.

The parameters of primary particles for Monte Carlo simulations in GamBet are specified in source (SRC) files. Output escape files have the same format, so the output from one GamBet calculation can be used as the input to a subsequent one. For initial GamBet simulations, a source file representing specific distributions in position, velocity and energy in usually prepared. Although the GenDist utility can create several useful distributions, the possibilities with R are greatly expanded.

As a test case, we’ll generate an input electron beam for a 3D calculation with current CurrTotal = 2.5 A and average energy E0 = 20.0 keV. The beam has a circular cross section with a Gaussian distribution in x and y of width Xw = Yw = 1.4 mm. The average position is [X0,Y0,Z0] = [0.0 mm, 0.0 mm, 0.0 mm). The parallel electrons move in z. There is a Gaussian energy spread of Ew = 500 eV.

To begin, we set up a script that can be tested in the interactive environment of RStudio and then see how to convert it to an autonomous program that can be run from a batch file. The first set of commands clears the workspace, sets a working directory and defines parameters:

CurrTotal = 2.5
X0 = 0.0
Y0 = 0.0
Z0 = 0.0
Xw = 1.4
Yw = 1.4
E0 = 20000.0
Ew = 500.0
Ux0 = 0.0
Uy0 = 0.0
Uz0 = 1.0
NPart= 100000

The previous article discussed the SRC file format. The strategy will be to set up a vector of length NPart for each quantity in the file data lines, combine them into a data frame and then use the write.table() command to create the file. These commands create the vector for the Type column:

Type = character(length=NPart)
for (n in 1:NPart) {
Type[n] = "E"

The first command creates a character vector called Type with NPart blank entries. The loop fills the vector with the character E.

The quantities Z, Ux, Uy, and Uz are numerical vectors of length NPart that contain identical values. For this, it is convenient to create and to fill the vectors with the seq() and c() commands:

Z = c(seq(from=Z0,to=Z0,length.out=NPart))
Ux = c(seq(from=Ux0,to=Ux0,length.out=NPart))
Uy = c(seq(from=Uy0,to=Uy0,length.out=NPart))
Uz = c(seq(from=Uz0,to=Uz0,length.out=NPart))

The position vectors X and Y and the Energy vector are created using the rnorm() function. It generates NPart values following a specified normal distribution:

X = rnorm(NPart,mean=X0,sd=Xw)
Y = rnorm(NPart,mean=Y0,sd=Yw)
Energy = rnorm(NPart,mean=E0,sd=Ew)

The set of vectors is assembled into a data frame. Note that the column names are the same as the vector names, Type, Energy, X, …:

SRCRaw = data.frame(Type,Energy,X,Y,Z,Ux,Uy,Uz)

We must take some precautions. With the normal distribution, there is a very small but non-zero probability of extreme values of the position and energy. They could result in electrons outside the GamBet solution volume or negative energy values. Both conditions would lead to a program error. We create a subset of the raw data such that no electron has r > 7.0 mm or Energy ≤ 0.0:

SRCFile = subset(SRCRaw,((X^2+Y^2)<=25.0*Xw^2) & (Energy > 0.0))

We need to add the current per electron to complete the SRCFile data frame. Note the use of NLength, the number of electrons in the modified data frame, which may be less than NPart:

NLength = length(SRCFile$Type)
dCurr = CurrTotal/NLength

A column vector of identical current values is constructed and appended to the SRCFile data frame with the cbind() command:

Curr = c(seq(from=dCurr,to=dCurr,length.out=NLength))
SRCFile = cbind(SRCFile,Curr)

We also define two plotting vectors for latter use:

RVector = sqrt(SRCFile$X^2 + SRCFile$Y^2)
EVector = SRCFile$Energy

We’re ready to write the file. Some variables are set up in preparation:

FNameOut = "TestSRCGeneration.SRC"
HLine1 = "* GamBet Particle Escape File (Field Precision)"
HLine2 = paste("* NPart:",NLength)
HLine3 = "* Type   Energy         X            Y            Z             Ux           Uy           Uz       Curr"
HLine4 = "* =========================================================================================================="

Note that the actual number of electrons was written to HLine2. The following commands open the file (over-writing any previous version) and write the header using the cat() command:


We could add the table in its current form, but it would be nice to have the fixed-width format illustrated in the previous article. This command adds three initial spaces to the Type column:

SRCFile$Type = paste("   ",SRCFile$Type,sep="")

These commands convert the numbers in the columns to character representations with width 12 in either scientific or standard notation:

SRCFile$Energy = format(SRCFile$Energy,scientific=TRUE,digits=5,width=12)
SRCFile$X = format(SRCFile$X,scientific=TRUE,digits=5,width=12)
SRCFile$Y = format(SRCFile$Y,scientific=TRUE,digits=5,width=12)
SRCFile$Z = format(SRCFile$Z,scientific=TRUE,digits=5,width=12)
SRCFile$Ux = format(SRCFile$Ux,scientific=FALSE,digits=6,width=12)
SRCFile$Uy = format(SRCFile$Uy,scientific=FALSE,digits=6,width=12)
SRCFile$Uz = format(SRCFile$Uz,scientific=FALSE,digits=6,width=12)
SRCFile$Curr = format(SRCFile$Curr,scientific=TRUE,digits=6,width=12)

Note that RVector and EVector were defined when the data entries were still numbers. Finally, this command writes the SRC file:

write.table(SRCFile,file=FNameOut,sep=" ",append=TRUE,col.names=FALSE,quote=FALSE,row.names=FALSE)

Here is a sample of the result:

* GamBet Particle Escape File (Field Precision)
* NPart: 100000
* Type   Energy         X            Y            Z             Ux           Uy           Uz       Curr
* ==========================================================================================================
E   2.0448e+04  -6.8232e-01   8.4354e-01        0e+00            0            0            1      2.5e-05
E   1.9949e+04  -4.8475e-02   1.5114e+00        0e+00            0            0            1      2.5e-05
E   2.1422e+04  -6.9644e-01  -4.9890e-01        0e+00            0            0            1      2.5e-05
E   1.9291e+04  -1.9870e+00   5.5780e-01        0e+00            0            0            1      2.5e-05
E   2.0032e+04  -5.6534e-01  -2.0669e-01        0e+00            0            0            1      2.5e-05
E   2.0752e+04   1.2376e+00   7.0758e-01        0e+00            0            0            1      2.5e-05

The format isn’t precisely the way we would like. For example, despite the setting digits=6, the entries in Ux are 0 rather than 0.00000. This is a quirk of R. The program will not write more significant figures than the defining values, no matter what you tell it. Perhaps there is a solution, but I haven’t found it. GamBet and GenDist will recognize the above form. Figure 1 shows a GenDist scatter plot of model particles in the x-y plane.

Scatter plot in x-y produced by GenDist

Figure 1. Scatter plot in x-y produced from the SRC file by GenDist

The following commands create histograms to display the distribution of electrons in radius and energy (Figure 2):

Variation of particle density with radius and the energy spectrum

Figure 2. Variation of particle density with radius and the energy spectrum created with the hist() command.

To conclude, we’ll modify the script so it may be called from a batch file. Suppose we want to make a series of GamBet runs with beams of differing width. For each run, we regenerate the input SRC file with R to represent the new width, run GamBet and rename critical output files so they are not over-written. Here’s a sample of a Windows batch file showing the first operation

START /WAIT RScript Sect09DemoScript.R "C:\USINGRFORGAMBETSTATISTICALANALYSIS\Examples\Section09\" "1.4"
START /WAIT C:\fieldp_basic\gambet\gambet BatchDemo
RENAME BatchDemo.GLS BatchDemo01.GLS
RENAME BatchDemo.G3D BatchDemo01.G3D

The /WAIT option in the START command ensures that the SRC file is available before starting GamBet and that there will not be a file access conflict when GamBet is running. Consider the first command. RScript is a special form of R to run scripts. To avoid typing the full path to program every time, I modified the Windows path to include C:\Program Files\R\R-3.2.0\bin\. The next command-line parameter is the name of the script followed by two character values that are passed to the script. The first is the working directory for the data and second is the value of Xw.

Here’s how the script we have been discussing is modified to act as an autonomous program. The first set of commands becomes:

args = commandArgs()
WorkDir = args[6]

The second command stores the command line parameters in a character vector args. You may wonder, why start with args[6]? A listing of args gives the following information:

[1] C:\PROGRA~1\R\R-32~1.0\bin\i386\Rterm.exe
[2] --slave
[3] --no-restore
[4] --file=Sect09DemoScript.R
[5] --args
[7] 1.4

The first argument is the name of the running program, a typical convention in Windows. The next four are the values of options set in RTerm. The values of interest start at the sixth component. The next change is in the initialization section:

Xw = as.numeric(args[7])

Because command line parameters are strings, we force the value to be interpreted as a number to define Xw. Finally, suppose we want to inspect the histograms, even though the program is running in the background. The ending statements are modified to:


The results of all plot commands between pdf() and are sent to the specified PDF file where they can be viewed after the run.

In conclusion, the R package has a vast set of available commands and options that could occupy several textbooks. In this set of short articles, I’ve tried to cover the fundamental core. One of my goals has been to clarify the sometimes arcane syntax of R so you have the background to explore additional functions. A book form of the articles with an index will be available soon on our website.


[1] The entire series is available as a PDF book at

[2] Contact us :

[3] Field Precision home page:

Importing GamBet files into R

(This is eighth article in the series Using R for Gambet statistical analysis.[1]) In the previous articles, we studied R techniques that would be useful to supplement most technical programs. In this article and the next, we’ll concentrate on integrating R with GamBet. The GamBet program produces two types of text data files [2] that can be analyzed with R:

  • GamBet escape files have a name of the form FName.SRC. They record the parameters of electrons, photons and positrons that escape from the solution volume. An example is the distribution of bremsstrahlung photons produced by an electron beam striking a target. Escape files may serve as the particle source for a subsequent simulation. You can also use R to prepare source files with mathematically-specified distributions, the topic of the next article.
  • Files of the spatial distribution of deposited dose produced the MATRIX commands in GBView2 and GBView3.

We’ll begin with a discussion of the GamBet SRC file. Here’s an example, the output from a bremsstrahlung target:

* GamBet Particle Escape File (Field Precision)
* Output from run: BremGen
* DUnit:   1.0000E+03
* NPrimary:        1
* NShower:      500
* Type    Energy        X           Y           Z           ux          uy          uz      curr/flux
* ====================================================================================================
E    1.8276E+07  1.0000E+00 -2.2093E-01 -9.5704E-03     0.95867    -0.28441    -0.00726
P    1.0476E+06  1.0000E+00 -2.2099E-01 -9.6599E-03     0.96046    -0.27838     0.00421
P    1.4681E+07  1.0000E+00 -2.2079E-01 -9.4444E-03     0.96751    -0.24988     0.03846
P    1.1157E+05  1.0000E+00 -2.2125E-01 -9.0243E-03     0.93991    -0.33167     0.08099
P    3.7167E+06  1.0000E+00 -1.1189E-01 -6.2985E-02     0.99157    -0.11335    -0.06279
P    8.2996E+05  1.0000E+00  3.3150E-01 -2.6303E-01     0.91975     0.30757    -0.24386

There is a file header consisting of eight comment lines marked by an asterisk. The header is followed by a large number of data lines. The file terminates with an ENDFILE marker. Each data line contains 8 or 9 entries separated by space delimiters. A data line contains the following components:

  • The marker in the first column  gives the type of particle, electron (E or E-), photon (P) and positron (E+). In the example, the output is a mixture of the primary electron beam and the secondary photons.
  • The kinetic energy in eV.
  • The position at the exit, (x,y,z).
  • Components of a unit vector giving the particle direction (ux,uy,uz).

The ninth column is present only in runs where flux weighting is assigned to the model input particles.

If we are going to perform a standard analysis on many files, it would be an advantage to create an R script where the user could choose the working directory and the SRC file interactively. Here is the section of the script to specify and to load a file. It introduces several new concepts and commands. Copy and paste the text to a script file window in RStudio:

if (!exists("WorkDir")) {
WorkDir = choose.dir(default = "", caption = "Select folder")
FDefault = paste(WorkDir,"\\*.src",sep="")
FName = choose.files(default=FDefault,caption="Select GamBet SRC file",multi=FALSE)
CheckFile = read.table(FName,header=FALSE,sep="",comment.char="*",fill=TRUE,nrows=1)
NColumn = ncol(CheckFile)
if(NColumn==8) {
# Standard column names
cnames = c("Type","Energy","X","Y","Z","ux","uy","uz")
} else {
cnames = c("Type","Energy","X","Y","Z","ux","uy","uz","Flux")
SRCFile = read.table(FName,header=FALSE,sep="",comment.char="*",col.names=cnames,fill=TRUE)
NLength = nrow(SRCFile)
SRCFile = SRCFile[1:(NLength-1),]

The first command


occurs often in work with R. Many default commands are loaded when you start the R console. Although they have been sufficient for our previous work, they represent only a fraction of the available features. In this case, we load a library utils that supports interactive file operations. The command lines constitute an if statement:

if (!exists("WorkDir")) {
WorkDir = choose.dir(default = "", caption = "Select folder")

Simple if statements consist of a conditional line followed by any number of commands in braces. In this case, the commands are executed only if the object WorkDir does not exist (! designates the logical not operation). If we had a number of SRC files to analyze in the same directory, we would not want to reset the working directory every time. The choose.dir() operation brings up the standard Windows selection dialog of Fig. 1 and returns the path as WorkDir. The path is then set as the working directory.

Dialog for the choose.dir() operation

Figure 1. Dialog for the choose.dir() operation.

After picking a directory, we’ll use the choose.files() command to pick a file. One of the command parameters is a default file name. We again use the paste operation to concatenate the path name and the default file name:

FDefault = paste(WorkDir,"\\*.src",sep="")

The result looks like this:

FDefault = "C:\\USINGRFORGAMBETSTATISTICALANALYSIS\\Examples\\Section08\\*.src"

The double backslash is synonymous with the forward slash used in R. The command:

FName = choose.files(default=FDefault,caption="Select GamBet SRC file",multi=FALSE)

opens a standard dialog to return the name of a a single file in the working directory of type *.SRC (Figure 2). The example file contains 47,892 entries.

Dialog for the choose.files() operation

Figure 2. Dialog for the choose.files() operation.

The read.table() command provides a simple option for reading the file, but there are two challenges:

  • We don’t know whether there will be 8 or 9 data columns.
  • The line with ENDFILE does not contain any data.

These commands address the first problem:

CheckFile = read.table(FName,header=FALSE,sep="",comment.char="*",fill=TRUE,nrows=1)
NColumn = ncol(CheckFile)
if(NColumn==8) {
  cnames = c("Type","Energy","X","Y","Z","ux","uy","uz")
} else {
  cnames = c("Type","Energy","X","Y","Z","ux","uy","uz","Flux")

The read.table() command ignores the header comment lines and reads a single data line (nrows=1) to the dummy data frame CheckFile. The ncol () operator returns the number of data columns as NColumn. Depending on the value, we define a vector cnames of column names containing 8 or 9 components. The next read.table() command inputs the entire set of data lines plus the ENDFILE line:

SRCFile = read.table(FName,header=FALSE,sep="",comment.char="*",col.names=cnames,fill=TRUE)

Note that the number of columns and their names is set by col.names=cnames. In the absence of the fill=TRUE option, the operation would terminate with an error because the ENDFILE line has only one entry. The option specifies that a line with fewer entries than specified should be filled out with N/A values. The last line is meaningless, so we delete it with the commands:

NLength = nrow(SRCFile)
SRCFile = SRCFile[1:(NLength-1),]

Clearly, the procedure for loading a GamBet source file has several tricky features. Discovering them takes some trial-and-error. The advantage of a script program like R is that the effort is required only once. The script we have discussed provides a template to load all SRC files.

Now that the file has been loaded as the data frame SRCFile, we can do some calculations. Copy and paste the following information below the load commands:

electrons = subset(SRCFile,Type=="E" | Type=="E-")
photons = subset(SRCFile,Type=="P")
elecavg = mean(electrons$Energy)
photavg = mean(photons$Energy)
hist(electrons$Energy,breaks=20,density=15,main="Electron energy spectrum",xlab="T (eV)",ylab="N/bin")
hist(photons$Energy,breaks=20,density=15,main="Photon energy spectrum",xlab="T (eV)",ylab="N/bin")

This command:

electrons = subset(SRCFile,Type=="E" | Type=="E-")

creates a data frame electrons that contains only rows where the Type is E or E-. We calculate the mean kinetic energy of electrons emerging from the target and create a histogram. Figure 3 shows the result. At this point, you should be able to figure out the meanings of options in the hist() command. Use the Help tab in RStudio and type in hist for more detailed information.

Energy distributions determined from the test SRC file

Figure 3. Energy distributions of primary electrons and bremsstrahlung photons determined from the test SRC file.

To conclude, we’ll briefly discuss importing a matrix file from the two-dimensional GBView2 post-processor. A matrix file is a set of values computed over a regular grid (uniform intervals in x and y or z and r). In this case, the quantity is total dose (deposited energy/mass), dose from primary electrons, dose from primary photons, etc. Here is a sample of the first part of a file:

Matrix of values from data file alumbeam.G2D
XMin:   0.0000E+00    YMin:  -5.0000E-02
XMax:   1.0000E-01    YMax:   5.0000E-02
NX:  20    NY:  20
X           Y           NReg   DoseTotal   DoseElecP   DosePhotP   DosePosiP   DoseElecS   DosePhotS   DosePosiS
0.0000E+00 -5.0000E-02    1  3.3101E+04  2.6251E+04  0.0000E+00  0.0000E+00  6.8504E+03  0.0000E+00  0.0000E+00
5.0000E-03 -5.0000E-02    1  3.5977E+05  3.0285E+05  0.0000E+00  0.0000E+00  5.6925E+04  0.0000E+00  0.0000E+00
1.0000E-02 -5.0000E-02    1  3.5977E+05  3.0285E+05  0.0000E+00  0.0000E+00  5.6925E+04  0.0000E+00  0.0000E+00
1.5000E-02 -5.0000E-02    1  3.9705E+05  3.1705E+05  0.0000E+00  0.0000E+00  8.0007E+04  0.0000E+00  0.0000E+00
2.0000E-02 -5.0000E-02    1  3.9705E+05  3.1705E+05  0.0000E+00  0.0000E+00  8.0007E+04  0.0000E+00  0.0000E+00
2.5000E-02 -5.0000E-02    1  3.8548E+05  3.1449E+05  0.0000E+00  0.0000E+00  7.0986E+04  0.0000E+00  0.0000E+00

The data lines are space delimited and can easily be loaded with the read.table() command. The challenge is the header, where the lines are not comments. Because the header information is not required for the R analysis, we can simply omit it by using the skip option. The following code loads matrix file information, limits data to a scan along x at y = 0.0, carries out a fourth-order polynomial fit and plots the results (Figure 4).

cnames  = c("x","y","NReg","DRate","DoseElecP","DosePhotP","DosePosiP","DoseElecS","DosePhotS","DosePosiS")
MatrixData = read.table(file="Demo.MTX",header=FALSE,sep="",col.names=cnames,skip=6)
AxisPlot = subset(MatrixData,y > -0.00001 & y < 0.00001)
DFit = lm(DRate~I(x)+I(x^2)+I(x^3)+I(x^4),AxisPlot)
PlotSeq = seq(from=0.0,to=0.10,length.out=101)
PlotPos = data.frame(x=PlotSeq)
Fitted dose distribution determined from a GBView2 matrix file

Figure 4. Fitted dose distribution determined from a GBView2 matrix file.


[1] The entire series is available as a PDF book at

[2] The read.fortran() command of R provides a path to import information from the binary output files of any Field Precision technical program.

[3] Contact us :

[4] Field Precision home page:


Nonlinear fitting and parameter estimation

(This is seventh article in the series Using R for Gambet statistical analysis.[1]) To this point, we have carried out basic statistical calculations with R. With some effort, we could perform the operations in spreadsheets or our own programs. In this article, we’ll turn to a much more sophisticated calculation that would be difficult to reproduce: nonlinear parameter fitting. Despite the complexity of the underlying calculations, it is simple to set up the R script.

Figure 1 illustrates the test calculation. Given a noisy signal, we want to determine if there is a resonance hidden within it. Such a response usually has the form of a Gaussian function

A exp[-(x-mu)^2/(2.0*sigma^2)          [1]

We want to determine values of A, mu and sigma (the resonance width) such that the function gives the most probable fit to the data of Fig. 1. We also want to determine confidence levels for the fit. This is clearly a nonlinear problem because the parameters mu and sigma are integrated in the function.

Resonance in a noisy signal

Figure 1. Test data: resonance in a noisy signal.

The spreadsheet of Fig. 2 is used to generate data over the range 0.0 < x < 20.0. Values of y are determined by Eq. 1 with A = 20.0, mu = 8.58 and sigma = 1.06. The noise level is set as an adjustable parameter. Rather than save the spreadsheet as a CSV file, we copy the data section as shown and past into a text file file, In this case, the entries on a line are separated by tab characters instead of commas. Therefore, we use a different command form to read the file:

PeakData = read.table("",header=T,sep="")

The header sets the column names of the data frame PeakData to y and x. The option sep=”” means that any white-space characters (including tabs) act as delimiters.

Spreasheet to generate data

Figure 2. Spreasheet to generate data pasted into a tab-delimited file.

To run the example, copy and paste this text into the R script editor:

PeakData = read.table("",header=T,sep="")
fit = nls(y ~ I(Amp*exp(-1.0*(Mu-x)^2/(2.0*Sigma^2))),data=PeakData, start=list(Amp=5.0,Mu=5.0,Sigma=5.0))
new = data.frame(x = seq(min(PeakData$x),max(PeakData$x),len=200))

The critical command line for the nonlinear fit is

fit = nls(y ~ I(Amp*exp(-1.0*(Mu-x)^2/Sigma^2)),data=PeakData, start=list(Amp=5.0,Mu=5.0,Sigma=5.0))

The first argument of the nls() command is a defining formula:

y ~ I(Amp*exp(-1.0*(Mu-x)^2/(2.0*Sigma^2)))

It implies that the variable y depends on the independent variable x as well as parameters named Amp, Mu and Sigma. The dependence is declared explicitly and enclosed in the I() function to be safe. The second argument, data=PeakData, defines the input data source. The third argument,


needs some explanation. The iterative calculation involves varying parameter values and checking whether the overall error is reduced. The user must set reasonable initial values for the search. The start option requires an R list, a general set of values. The list() operator combines several objects into a list. In the form shown, the first item in the list has name Amp and value 5.0. Be aware that for some choices of parameters, the process may not converge or may converge to an alternate solution.

The summary() command produces a console listing like this:

Formula: y ~ I(Amp * exp(-1 * (Mu - x)^2/(2 * Sigma^2)))

Estimate Std. Error t value Pr(>|t|)
Amp   22.45079    1.86245   12.05 1.49e-14 ***
Mu     8.63087    0.08361  103.22  < 2e-16 ***
Sigma -0.87288    0.08361  -10.44 1.02e-12 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.675 on 38 degrees of freedom

Number of iterations to convergence: 13
Achieved convergence tolerance: 5.277e-06

The next three commands plot the raw data and the fitting line following methods that we discussed in previous articles. Finally, there is a new command that gives the confidence interval:


which produces a console listing like this:

2.5%      97.5%
Amp   18.738008 26.3724502
Mu     8.460237  8.8129274
Sigma -1.071273 -0.7146478

The results have the following interpretation. What is the range where we can have 95% confidence that the parameter value lies within it. In other word, for the above example we can have 95% confidence that the peak amplitude is between 18.738008 and 26.3724502.

Results with Noise = 5.0

Figure 3. Results with Noise = 5.0.

With an understanding of the mechanics of the calculation, we can check out some results. To begin, a high level of noise in the spreadsheet (Noise = 5.0) leads to the data points and fitting line of Fig. 3. Here is a summary of numerical results:

Noise level: 5.0
      Estimate   Std. Error       2.5%      97.5%    Generating
Amp   22.45079    1.86245    18.738008 26.3724502      20.0
Mu     8.63087    0.08361     8.460237  8.8129274       8.5
Sigma -0.87288    0.08361    -1.071273 -0.7146478      1.06

The negative sign for Sigma is not an issue because the parameter always appears as a square. Because the peak is barely discernible in the noise, there is some error in the estimate of Sigma. Rerunning the example with reduced noise level (Noise = 0.5) provides a check that the nls() command is operating correctly. Figure 4 shows the data points and fitting line. The results are much closer to the ideal values with no signal noise.

Noise level: 0.5
      Estimate  Std. Error      2.5%     97.5%
Amp   20.02893    0.17535    19.673012 20.386997
Mu     8.57904    0.01084     8.557082  8.600996
Sigma  1.07258    0.01084     1.050487  1.095135
Results with Noise = 0.5

Figure 3. Results with Noise = 0.5.


[1] The entire series is available as a PDF book at

[2] Contact us :

[3] Field Precision home page:

Working with RStudio: multi-dimensional fitting

(This is sixth article in the series Using R for Gambet statistical analysis.[1]) RStudio is an integrated development environment (IDE) for working with R, similar to the IDEs available for most computer languages. Figure 1 shows a screenshot. There are work areas for the script editor, console and plots. The difference is that they are combined in a single convenient workspace rather than floating windows. RStudio is freely available and provides a wealth of features, so it is hard to imagine why anyone would choose not to use it. Here are some of the advantages:

  • RStudio remembers your screen setups, so it is not necessary to resize and reposition floating windows in every session.
  • The script editor (upper left in Fig. 1) features syntax highlighting and an integrated debugger. You can keep multiple tabbed documents active.
  • An additional area at upper-right shows all R objects current defined. Double clicking a data frame opens it in a data editor tab in the script window.
  • The area at the lower right has multiple functions in addition to plot displays. Other tabs invoke a file manager, a package manager for R supplements and a Help page for quick access to information on R commands.

I’ll assume you’re using RStudio for the example of this article and following ones.

RStudio screenshot

Figure 1. RStudio screenshot.

This article shows how to use R to fit a three-dimensional polynomial function to noisy data. A practical example is making an accurate interpolation of radiation dose at a point , D(x,y,z), in a GamBet calculation. Here, the dose in any single element is subject to statistic variations that depend on the number of model input particles. The idea is to get an estimate at a test point [x0,y0,z0] by finding the most probable smooth function using values from a large number of nearby elements. The approach is valid when the physics implies that the quantity varies smoothly in space (e.g, the variation of potential in electrostatics). The present calculation is the three-dimensional extension of the interpolation discussed in a previous article.

As with previous examples, we’ll need to create some test data. The measurement point is [XAvg = 23.45, YAvg = 10.58, ZAvg = -13.70] and the ideal variation of dose about the point is

D = 19.45 + 2.52*(x-XAvg) – 0.25*(X-XAvg)^2
– 4.23*(y-YAvg) + 0.36*(y-YAvg)^2
+ 1.75*(z-ZAvg) – 0.41*(z-ZAvg)^2               [1]

The spacing of data points about the measurement point is Dx = 1.00, Dy = 1.25 and   Dz = 0.90. In this case, we’ll use a computer program to record the data in a CSV file. For reference, here is the core of the FORTRAN program:

WRITE (30,2000)
XAvg = 23.45
YAvg = 10.58
ZAvg = -13.70
Dx = 1.00
Dy = 1.25
Dz = 0.90
! Main data loop
DO Nz = -5,5
  z = ZAvg + REAL(Nz)*Dz
  DO Ny = -5,5
    y = YAvg + REAL(Ny)*Dy
    DO Nx = -5,5
      x = XAvg + REAL(Nx)*Dx
      Noise = 2.0*(0.5 - Zeta)
      D = 19.45 &
        + 2.52*(x-XAvg) - 0.25*(X-XAvg)**2 &
        - 4.23*(y-YAvg) + 0.36*(y-YAvg)**2 &
        + 1.75*(z-ZAvg) - 0.41*(z-ZAvg)**2 &
        + Noise
      WRITE (30,2100) x,y,z,D
    END DO
CLOSE (30)
2000 FORMAT ('x,y,z,D')
2100 FORMAT (F14.8,',',F14.8,',',F14.8,',',F14.8)

The program creates a header and then writes 1331 data lines in CSV format. Each line contains the position and dose, [x,y,z,D]. Here are the initial entries in multiregress.csv:

18.45000076,    4.32999992,  -18.19999886,   22.14448738
19.45000076,    4.32999992,  -18.19999886,   28.94458199
20.45000076,    4.32999992,  -18.19999886,   38.34004974

Run R, clear the console window (Control-L) and close any previous script. Copy and paste the following text into the script editor:

DataSet = read.csv("multiregress.csv",header=T)
# Calculate the measurement point
# Reference spatial variables to the measurement point
DataSet$x = DataSet$x - XAvg
DataSet$y = DataSet$y - YAvg
DataSet$z = DataSet$z - ZAvg
# Perform the fitting
FitModel = lm(D~x+y+z+I(x^2)+I(y^2)+I(z^2),DataSet)
# Plot a scan along x through the measurement point
xscan = subset(DataSet,y < 0.001 & y > -0.001 & z < 0.001 & z > -0.001 )

The first three lines perform operations that we have previously discussed: clear all R objects, set a working directory and read information from the CSV file created by the FORTRAN program. The information is stored in data frame DataSet. Use Control-R in the script editor to execute the command lines. Note that DataSet appears in the Environment list of RStudio. Double-click on DataSet to open it in a data-editor tab. Here, you can inspect values and even modify them,

The first task in the analysis is to shift the spatial values so that they are referenced to the measurement point, as in Eq. [1].

DataSet$x = DataSet$x - XAvg
DataSet$y = DataSet$y - YAvg
DataSet$z = DataSet$z - ZAvg

We use the mean() function under the assumption that the data points are centered about the measurement position. A single command performs the linear fit:

FitModel = lm(D~x+y+z+I(x^2)+I(y^2)+I(z^2),DataSet)

The summary() command gives the following information:

            Estimate   Std. Error  Generating
(Intercept) 19.540509   0.179142    19.45
x            2.560374   0.025733     2.52
y           -4.226011   0.020587    -4.23
z            1.737097   0.028593     1.75
I(x^2)      -0.265439   0.009214    -0.25
I(y^2)       0.358165   0.005897     0.36
I(z^2)      -0.404630   0.011375    -0.41

As with all statistical calculations, a visual indicator is a quick way to gauge the validity of the fit. The following commands create a two-dimensional plot, a scan of data points and the fitting function along x at y = 0.0, z = 0.0 (through the measurement point). The plot is saved as a PNG file in the working directory.

[1] png(filename="XScan.png")
[2] xscan = subset(DataSet,y < 0.001 & y > -0.001 & z < 0.001 & z > -0.001 )
[3] plot(xscan$x,xscan$D)
[4] lines(xscan$x,predict(FitModel,newdata=xscan))

Line [1] opens a file for PNG output, while Line5] flushes the plot information to the file and closes it. Line[2] uses the subset() function to create a data frame, xscan, that contains only data points where y = 0.0 and z = 0.0.  The first argument is the source data frame and the second argument is the criterion for picking included row. There are two noteworthy features in the command expression:

  • The symbol & denotes the logical and operation.
  • The inclusion criterion has some slack. This is to account to floating point round-off errors. If you inspect xscan with the data editor, you will note that are some values on the order of 10[-8] rather than 0.0

Line [3] plots the raw data points while the lines() command in Line 4 applies the method discussed in a previous article to plot the fitting function. Similar operations were applied to create scans through the measurement point along y and z to produce the plots of Fig. 2.

Scans of the data and fitting function

Figure 2. Scans of the data and fitting function along x, y and z passing through the measurement point.


[1] The entire series is available as a PDF book at

[2] Contact us :

[3] Field Precision home page: