Nonlinear fitting and parameter estimation

(This is seventh article in the series Using R for Gambet statistical analysis.) To this point, we have carried out basic statistical calculations with R. With some effort, we could perform the operations in spreadsheets or our own programs. In this article, we’ll turn to a much more sophisticated calculation that would be difficult to reproduce: nonlinear parameter fitting. Despite the complexity of the underlying calculations, it is simple to set up the R script.

Figure 1 illustrates the test calculation. Given a noisy signal, we want to determine if there is a resonance hidden within it. Such a response usually has the form of a Gaussian function

A exp[-(x-mu)^2/(2.0*sigma^2)          [1]

We want to determine values of A, mu and sigma (the resonance width) such that the function gives the most probable fit to the data of Fig. 1. We also want to determine confidence levels for the fit. This is clearly a nonlinear problem because the parameters mu and sigma are integrated in the function.

Resonance in a noisy signal

Figure 1. Test data: resonance in a noisy signal.

The spreadsheet of Fig. 2 is used to generate data over the range 0.0 < x < 20.0. Values of y are determined by Eq. 1 with A = 20.0, mu = 8.58 and sigma = 1.06. The noise level is set as an adjustable parameter. Rather than save the spreadsheet as a CSV file, we copy the data section as shown and past into a text file file, peakdata.tab. In this case, the entries on a line are separated by tab characters instead of commas. Therefore, we use a different command form to read the file:

PeakData = read.table("peakdata.tab",header=T,sep="")

The header sets the column names of the data frame PeakData to y and x. The option sep=”” means that any white-space characters (including tabs) act as delimiters.

Spreasheet to generate data

Figure 2. Spreasheet to generate data pasted into a tab-delimited file.

To run the example, copy and paste this text into the R script editor:

PeakData = read.table("peakdata_high.tab",header=T,sep="")
fit = nls(y ~ I(Amp*exp(-1.0*(Mu-x)^2/(2.0*Sigma^2))),data=PeakData, start=list(Amp=5.0,Mu=5.0,Sigma=5.0))
summary(fit)
plot(PeakData$x,PeakData$y)
new = data.frame(x = seq(min(PeakData$x),max(PeakData$x),len=200))
lines(new$x,predict(fit,newdata=new))
confint(fit,level=0.95)

The critical command line for the nonlinear fit is

fit = nls(y ~ I(Amp*exp(-1.0*(Mu-x)^2/Sigma^2)),data=PeakData, start=list(Amp=5.0,Mu=5.0,Sigma=5.0))

The first argument of the nls() command is a defining formula:

y ~ I(Amp*exp(-1.0*(Mu-x)^2/(2.0*Sigma^2)))

It implies that the variable y depends on the independent variable x as well as parameters named Amp, Mu and Sigma. The dependence is declared explicitly and enclosed in the I() function to be safe. The second argument, data=PeakData, defines the input data source. The third argument,

start=list(Amp=5.0,Mu=5.0,Sigma=5.0)

needs some explanation. The iterative calculation involves varying parameter values and checking whether the overall error is reduced. The user must set reasonable initial values for the search. The start option requires an R list, a general set of values. The list() operator combines several objects into a list. In the form shown, the first item in the list has name Amp and value 5.0. Be aware that for some choices of parameters, the process may not converge or may converge to an alternate solution.

The summary() command produces a console listing like this:

Formula: y ~ I(Amp * exp(-1 * (Mu - x)^2/(2 * Sigma^2)))

Parameters:
Estimate Std. Error t value Pr(>|t|)
Amp   22.45079    1.86245   12.05 1.49e-14 ***
Mu     8.63087    0.08361  103.22  < 2e-16 ***
Sigma -0.87288    0.08361  -10.44 1.02e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.675 on 38 degrees of freedom

Number of iterations to convergence: 13
Achieved convergence tolerance: 5.277e-06

The next three commands plot the raw data and the fitting line following methods that we discussed in previous articles. Finally, there is a new command that gives the confidence interval:

confint(fit,level=0.95)

which produces a console listing like this:

2.5%      97.5%
Amp   18.738008 26.3724502
Mu     8.460237  8.8129274
Sigma -1.071273 -0.7146478

The results have the following interpretation. What is the range where we can have 95% confidence that the parameter value lies within it. In other word, for the above example we can have 95% confidence that the peak amplitude is between 18.738008 and 26.3724502.

Results with Noise = 5.0

Figure 3. Results with Noise = 5.0.

With an understanding of the mechanics of the calculation, we can check out some results. To begin, a high level of noise in the spreadsheet (Noise = 5.0) leads to the data points and fitting line of Fig. 3. Here is a summary of numerical results:

Noise level: 5.0
      Estimate   Std. Error       2.5%      97.5%    Generating
Amp   22.45079    1.86245    18.738008 26.3724502      20.0
Mu     8.63087    0.08361     8.460237  8.8129274       8.5
Sigma -0.87288    0.08361    -1.071273 -0.7146478      1.06

The negative sign for Sigma is not an issue because the parameter always appears as a square. Because the peak is barely discernible in the noise, there is some error in the estimate of Sigma. Rerunning the example with reduced noise level (Noise = 0.5) provides a check that the nls() command is operating correctly. Figure 4 shows the data points and fitting line. The results are much closer to the ideal values with no signal noise.

Noise level: 0.5
      Estimate  Std. Error      2.5%     97.5%
Amp   20.02893    0.17535    19.673012 20.386997
Mu     8.57904    0.01084     8.557082  8.600996
Sigma  1.07258    0.01084     1.050487  1.095135
Results with Noise = 0.5

Figure 3. Results with Noise = 0.5.

Footnotes

[1] Contact us : techinfo@fieldp.com.

[2] Field Precision home page: www.fieldp.com.

Working with RStudio: multi-dimensional fitting

(This is sixth article in the series Using R for Gambet statistical analysis.) RStudio is an integrated development environment (IDE) for working with R, similar to the IDEs available for most computer languages. Figure 1 shows a screenshot. There are work areas for the script editor, console and plots. The difference is that they are combined in a single convenient workspace rather than floating windows. RStudio is freely available and provides a wealth of features, so it is hard to imagine why anyone would choose not to use it. Here are some of the advantages:

  • RStudio remembers your screen setups, so it is not necessary to resize and reposition floating windows in every session.
  • The script editor (upper left in Fig. 1) features syntax highlighting and an integrated debugger. You can keep multiple tabbed documents active.
  • An additional area at upper-right shows all R objects current defined. Double clicking a data frame opens it in a data editor tab in the script window.
  • The area at the lower right has multiple functions in addition to plot displays. Other tabs invoke a file manager, a package manager for R supplements and a Help page for quick access to information on R commands.

I’ll assume you’re using RStudio for the example of this article and following ones.

RStudio screenshot

Figure 1. RStudio screenshot.

This article shows how to use R to fit a three-dimensional polynomial function to noisy data. A practical example is making an accurate interpolation of radiation dose at a point , D(x,y,z), in a GamBet calculation. Here, the dose in any single element is subject to statistic variations that depend on the number of model input particles. The idea is to get an estimate at a test point [x0,y0,z0] by finding the most probable smooth function using values from a large number of nearby elements. The approach is valid when the physics implies that the quantity varies smoothly in space (e.g, the variation of potential in electrostatics). The present calculation is the three-dimensional extension of the interpolation discussed in a previous article.

As with previous examples, we’ll need to create some test data. The measurement point is [XAvg = 23.45, YAvg = 10.58, ZAvg = -13.70] and the ideal variation of dose about the point is

D = 19.45 + 2.52*(x-XAvg) – 0.25*(X-XAvg)^2
– 4.23*(y-YAvg) + 0.36*(y-YAvg)^2
+ 1.75*(z-ZAvg) – 0.41*(z-ZAvg)^2               [1]

The spacing of data points about the measurement point is Dx = 1.00, Dy = 1.25 and   Dz = 0.90. In this case, we’ll use a computer program to record the data in a CSV file. For reference, here is the core of the FORTRAN program:

OPEN(FILE='multiregress.csv',UNIT=30)
WRITE (30,2000)
XAvg = 23.45
YAvg = 10.58
ZAvg = -13.70
Dx = 1.00
Dy = 1.25
Dz = 0.90
! Main data loop
DO Nz = -5,5
  z = ZAvg + REAL(Nz)*Dz
  DO Ny = -5,5
    y = YAvg + REAL(Ny)*Dy
    DO Nx = -5,5
      x = XAvg + REAL(Nx)*Dx
      CALL RANDOM_NUMBER(Zeta)
      Noise = 2.0*(0.5 - Zeta)
      D = 19.45 &
        + 2.52*(x-XAvg) - 0.25*(X-XAvg)**2 &
        - 4.23*(y-YAvg) + 0.36*(y-YAvg)**2 &
        + 1.75*(z-ZAvg) - 0.41*(z-ZAvg)**2 &
        + Noise
      WRITE (30,2100) x,y,z,D
    END DO
  END DO
END DO
CLOSE (30)
2000 FORMAT ('x,y,z,D')
2100 FORMAT (F14.8,',',F14.8,',',F14.8,',',F14.8)

The program creates a header and then writes 1331 data lines in CSV format. Each line contains the position and dose, [x,y,z,D]. Here are the initial entries in multiregress.csv:

x,y,z,D
18.45000076,    4.32999992,  -18.19999886,   22.14448738
19.45000076,    4.32999992,  -18.19999886,   28.94458199
20.45000076,    4.32999992,  -18.19999886,   38.34004974
...

Run R, clear the console window (Control-L) and close any previous script. Copy and paste the following text into the script editor:

rm(list=objects())
setwd("C:/RExamples/MultiRegress")
DataSet = read.csv("multiregress.csv",header=T)
# Calculate the measurement point
XAvg=mean(DataSet$x)
YAvg=mean(DataSet$y)
ZAvg=mean(DataSet$z)
# Reference spatial variables to the measurement point
DataSet$x = DataSet$x - XAvg
DataSet$y = DataSet$y - YAvg
DataSet$z = DataSet$z - ZAvg
# Perform the fitting
FitModel = lm(D~x+y+z+I(x^2)+I(y^2)+I(z^2),DataSet)
summary(FitModel)
# Plot a scan along x through the measurement point
png(filename="XScan.png")
xscan = subset(DataSet,y < 0.001 & y > -0.001 & z < 0.001 & z > -0.001 )
plot(xscan$x,xscan$D)
lines(xscan$x,predict(FitModel,newdata=xscan))
dev.off()

The first three lines perform operations that we have previously discussed: clear all R objects, set a working directory and read information from the CSV file created by the FORTRAN program. The information is stored in data frame DataSet. Use Control-R in the script editor to execute the command lines. Note that DataSet appears in the Environment list of RStudio. Double-click on DataSet to open it in a data-editor tab. Here, you can inspect values and even modify them,

The first task in the analysis is to shift the spatial values so that they are referenced to the measurement point, as in Eq. [1].

XAvg=mean(DataSet$x)
YAvg=mean(DataSet$y)
ZAvg=mean(DataSet$z)
DataSet$x = DataSet$x - XAvg
DataSet$y = DataSet$y - YAvg
DataSet$z = DataSet$z - ZAvg

We use the mean() function under the assumption that the data points are centered about the measurement position. A single command performs the linear fit:

FitModel = lm(D~x+y+z+I(x^2)+I(y^2)+I(z^2),DataSet)

The summary() command gives the following information:

            Estimate   Std. Error  Generating
(Intercept) 19.540509   0.179142    19.45
x            2.560374   0.025733     2.52
y           -4.226011   0.020587    -4.23
z            1.737097   0.028593     1.75
I(x^2)      -0.265439   0.009214    -0.25
I(y^2)       0.358165   0.005897     0.36
I(z^2)      -0.404630   0.011375    -0.41

As with all statistical calculations, a visual indicator is a quick way to gauge the validity of the fit. The following commands create a two-dimensional plot, a scan of data points and the fitting function along x at y = 0.0, z = 0.0 (through the measurement point). The plot is saved as a PNG file in the working directory.

[1] png(filename="XScan.png")
[2] xscan = subset(DataSet,y < 0.001 & y > -0.001 & z < 0.001 & z > -0.001 )
[3] plot(xscan$x,xscan$D)
[4] lines(xscan$x,predict(FitModel,newdata=xscan))
[5] dev.off()

Line [1] opens a file for PNG output, while Line5] flushes the plot information to the file and closes it. Line[2] uses the subset() function to create a data frame, xscan, that contains only data points where y = 0.0 and z = 0.0.  The first argument is the source data frame and the second argument is the criterion for picking included row. There are two noteworthy features in the command expression:

  • The symbol & denotes the logical and operation.
  • The inclusion criterion has some slack. This is to account to floating point round-off errors. If you inspect xscan with the data editor, you will note that are some values on the order of 10[-8] rather than 0.0

Line [3] plots the raw data points while the lines() command in Line 4 applies the method discussed in a previous article to plot the fitting function. Similar operations were applied to create scans through the measurement point along y and z to produce the plots of Fig. 2.

Scans of the data and fitting function

Figure 2. Scans of the data and fitting function along x, y and z passing through the measurement point.

Footnotes

[1] Contact us : techinfo@fieldp.com.

[2] Field Precision home page: www.fieldp.com.

Linear curve fitting and plotting, part B

(This is fifth article in the series Using R for Gambet statistical analysis.) We’ll continue the examination of linear fitting and plots with another example, a statistical Fourier analysis. It illustrates that 1) the lm() command can deal with any function and 2) data frame components need not always be called y and x. Here is the script to copy and paste in the R script window:

# Function to generate test data
TrigDemo = function(z) {
Out = 1.4+2.30*sin(z)+0.75*cos(z)-1.80*sin(2*z)+2.15*cos(2*z)+0.65*sin(3*z)-0.25*cos(3*z)
return(Out)
}
# Set up some variables
zmin = 0.0
zmax = 5.0
Noise = 0.55
# Plot the function without noise
curve(TrigDemo(z),zmin,zmax,xname="z")
# Create a data frame of input data
zvector = seq(from=zmin,to=zmax,length.out=51)
phivector = TrigDemo(zvector)
TestData = data.frame(z=zvector,phi=phivector)
# Add some noise
TestData$phi = TestData$phi + rnorm(length(TestData$phi),0,Noise)
# Plot the raw data
plot(TestData$z,TestData$phi)
# Fit the data and list the results
FitTrig = lm(phi~I(sin(z))+I(cos(z))+I(sin(2*z))+I(cos(2*z))+I(sin(3*z))+I(cos(3*z)),TestData)
summary(FitTrig)
# Add the fitted line to the plot
PlotSeq = seq(from=0.0,to=5.0,length.out=51)
PlotPos = data.frame(z=PlotSeq)
lines(PlotPos$z,predict(FitTrig,newdata=PlotPos))

We’ll generate data by adding statistical noise to the test function

phi = 1.4 + 2.30*sin(z) + 0.75*cos(z) – 1.80*sin(2*z) + 2.15*cos(2*z) + 0.65*sin(3*z) – 0.25*cos(3*z)     [1]

over the interval 0.0 < z < 5.0. Equation [1] is a lengthy expression, and it would be inconvenient to add it in command arguments. For efficiency, we’ll define our own function to supplement the built-in functions of R like. Here are the corresponding lines:

TrigDemo = function(z) {
Out = 1.4+2.30*sin(z)+0.75*cos(z)-1.80*sin(2*z)+2.15*cos(2*z)+0.65*sin(3*z)-0.25*cos(3*z)
return(Out)
}

The top definition line contains the name of the function, the keyword function and the names of the argument (or arguments) that will be used in the calculation. Any number of command lines for calculations with the argument may be included between the braces. R functions have flexible output. There is a single output value if the input argument is a scalar while, a vector argument gives a vector output. In other circumstances, the output may even be the functional dependence itself. The type of output depends on the context. To demonstrate the third case, we’ll use the TrigDemo function to create a plot of the variation of Eq. [1] using the curve() command:

curve(TrigDemo(z),zmin,zmax,xname="z")

The first argument is a definition of the mathematical expression to plot, such as “18.67 + z^2″, the next two arguments are numerical values that give the range of the independent variable and the last argument defines the name of the independent variable. Figure 1 shows the resulting plot.

Plot of the generating curve for the Fourier example

Figure 1. Plot of the generating curve for the Fourier example.

The following script command lines demonstrate a case where the function produces vector output:

[1] zvector = seq(from=zmin,to=zmax,length.out=51)
[2] phivector = TrigDemo(zvector)
[3] TestData = data.frame(z=zvector,phi=phivector)

As we saw in the previous article, the seq() operator generates a vector zvector that contains a set of z values over the desired range with interval δz = 0.1. We apply the function TrigDemo with argument zvector to generator phivector, a set of corresponding dependent values. The vectors are combined into a data frame with column names z and phi. Finally, we want to add some statistical noise:

TestData$phi = TestData$phi + rnorm(length(TestData$phi),0,Noise)

The function rnorm() generates random numbers in a normal distribution. R includes a related set of random-number functions for different distributions (runif(), rpois(),…). The output is a vector. The arguments in rnorm() are 1) the length of the output vector, 2) the mean of the distribution and 3) the standard deviation. The points in Fig, 2 show the resulting noisy representation of the generating function.

Noisy data and fitting function

Figure 2. Points show the noisy input function, and the line if the most probable fit with the assumed fitting function.

This form of the lm() command performs the fit:

FitTrig = lm(phi~I(sin(z))+I(cos(z))+I(sin(2*z))+I(cos(2*z))+I(sin(3*z))+I(cos(3*z)),TestData)

Finally, this set of commands plot the best-fit line in Fig. 2 following the method discussed in the previous article:

PlotSeq = seq(from=0.0,to=5.0,length.out=51)
PlotPos = data.frame(z=PlotSeq)
lines(PlotPos$z,predict(FitTrig,newdata=PlotPos))

Note how the column names of TestData, z and phi, are used consistently throughout all the commands.

Footnotes

[1] Contact us : techinfo@fieldp.com.

[2] Field Precision home page: www.fieldp.com.

Linear curve fitting and plotting, part A

(This is fourth article in the series Using R for Gambet statistical analysis.) In this article and the next, we’ll discuss linear curve fitting (or parameter estimation). Here, the term linear does not mean we are restricted to linear fitting functions, but rather that the parameters we seek are linear multipliers of the functions. To illustrate the distinction, suppose we have a set of measurements of a dependent quantity yi(xi). We hypothesize that the measurements can be approximated by a variation of the form:

y = a1*f1(x) + a2*f2(x) + ….   [1]

where f1(x), f2(x), … are known functions of any type. The task is to find the set of parameters a1, a2, … such that Eq. [1] has the highest probability of matching the measured points. Given that Eq. [1] represents a linear fit, what is a nonlinear fit? To illustrate, suppose we wanted to find a resonance response buried in a noisy signal. In this case, the proposed function would be

y = a1*exp(-(x-a2)^2/a3^2

We seek three parameters, only one of which is a linear multiple of the function. The other two parameters are integrated within the function. While the linear problem is deterministic, nonlinear calculations require iterative methods that may or may not not converge. A following article discusses nonlinear fitting in R.

This article and the next one cover univariate linear fitting, y = f(x). In a following article, we’ll advance to multi-dimensional fitting, z = g(x,y),…. To start, let’s revisit the example from the previous article. The points in Figure 1 show the raw data in the file csvdata.csv. The points follow a curve with two inflection points, so we suspect that a third order polynomial would be a good guess:

y = a + b*x + c*x^2 + d*x^3

This function satisfies the definition of a linear fit (Eq. [1]).

Test data with a third-order polynomial fit

Figure 1. Test data with a third-order polynomial fit.

 

Copy and paste the following text into the script window of R:

setwd("C:/RExcamples/LinFitStudies")
TestData = read.csv("csvdata.csv",header=TRUE)
FitPoly = lm(y~I(x)+I(x^2)+I(x^3),TestData)
summary(FitPoly)
CheckPoints = data.frame(x = c(0.0,5.0,10.0,15.0,20.0))
Pred = predict(FitPoly,newdata=CheckPoints)
Pred
plot(TestData$x,TestData$y,type="p")
PlotSeq = seq(from=0.0,to=20.0,length.out=201)
PlotPos = data.frame(x=PlotSeq)
lines(PlotPos$x,predict(FitPoly,newdata=PlotPos))

As before, we set a working directory. It contains a copy of the test data prepared for the previous article. Save the script as LinFit.R. Again we read the file contents to a data frame named TestData and apply the lm() command to produce the fitting object FitPoly:

FitPoly = lm(y~I(x)+I(x^2)+I(x^3),TestData)

The fitting function specification is more complex than the first order function of the previous article:

y~I(x)+I(x^2)+I(x^3)

It implies that values in the column named y of TestData depend on the values in the column named x. The functional dependencies involve powers of x up to x^3. Note the appearance of the R command I(). It signals that the symbols inside the argument should be interpreted as arithmetic operations rather than function specifications. There is some overlap between the two. For example, in a function specification the + symbol does not designate addition, but rather signals that R should include another functional dependency[1].

The command summary(FitPoly) produces the following information:

               Estimate    Std. Error  Generating function
(Intercept)   22.278672    0.672675       22.67
I(x)           4.877190    0.293093        4.71
I(x^2)        -0.314646    0.034174       -0.30
I(x^3)         0.010301    0.001123        0.01

The coefficients of the original generating function have been included in the last column.

There are two more tasks to complete this example:

  • Check the interpolated value returned by the fitting function for comparison to the linear interpolation of the previous article.
  • Superimpose a plot of the predicted curve on the original data for a visual validity check.

The following command lines accomplish the first task:

[1] CheckPoints = data.frame(x = c(0.0,5.0,10.0,15.0,20.0))
[2] Pred = predict(FitPoly,newdata=CheckPoints)
[3] Pred

Line [1] uses the data.frame() and c() commands that we already discussed. The c() command combines several position values (including x = 15.0) in a vector. The data.frame() command incorporates the vector in a data frame named CheckPoints with one column named x. Line [2] contains the command predict() which requires two arguments: 1) an R fitting object and 2) a data frame with a named column that corresponds to the independent variable of the fitting object. The predict() command returns a value of y for each x value. The final line produces the following listing of information in the console window:

1        2        3        4        5
22.27867 40.08612 49.88714 59.40764 76.37351

The fourth value corresponds to x = 15.0.

The following lines create the plot of Fig. 1:

[1] plot(TestData$x,TestData$y,type="p")
[2] PlotSeq = seq(from=0.0,to=20.0,length.out=201)
[3] PlotPos = data.frame(x=PlotSeq)
[4] lines(PlotPos$x,predict(FitPoly,newdata=PlotPos))

We’re already familiar with the plot() command of Line [1] — it opens the plot and creates the set of points. The next three lines provide a useful template for plotting any fitted line:

  • The seq() command creates a vector PlotSeq of uniformly-spaced position values over the plot range 0.0 ≤ x ≤ 20.0. The vector contains 201 values with interval δx = 0.1.
  • Line [2] incorporates PlotSeq into the data frame PlotPos. The single column has the name x.
  • The lines() command adds a number of connected lines to an existing plot. The arguments are two vectors of equal length giving values along the abscissa and ordinate. We use PlotPos$x as the first vector and again use the predict() command to supply the second vector.

The result is the line in Fig. 1.

Footnotes:

[1] In the previous article, we used a function specification y~x without the I() command because the expression did not include any common symbols. To be safe, it is best to use I() for all functional expressions.

[2] Contact us : techinfo@fieldp.com.

[3] Field Precision home page: www.fieldp.com.

Creating and importing CSV files

(This is third article in the series Using R for Gambet statistical analysis.) Data from experiments and computer outputs may include thousands of items, so it is clearly impractical to enter values in R by typing them. Generally, data are recorded in files and we use special R commands to read them. R recognizes some binary formats like Excel, but the most reliable approach is to record the data in a text file. Properly formatted text files act as a universal data medium that can be created or read by almost any program. There are no issues of software version changes and, most important, there are no mysteries. You can check the files with any text editor.

The most common format for storing data in a text file is comma-separated-variable, or CSV. In a general sense, CSV could apply to any file that uses commas as a delimiter. We’ll use a more restrictive definition. A CSV file contains a number of data lines (rows). A line consists of text terminated by a carriage return and/or linefeed character. Each line is divided into the same number of text entries (columns) separated by commas. The text entries may represent strings or numbers. This organization is reminiscent of spreadsheet data or R data frames — it is not surprising that both types of programs can read and write directly to CSV files.

In this article we’ll use a spreadsheet to create data for a statistical analysis, save the results as a CSV file, read the file into R and perform several useful calculations. Figure 1 shows a spreadsheet setup to generate points in the range 0.0 ≤ x ≤ 20.0 from the formula

y = 22.67 + 4.71 x -0.3 x^2 + 0.01 x^3 [1]

Random noise uniformly distributed over the range -2.5 ≤ δy ≤ 2.5 is added for statistic analyses in R. Note that a header line containing the strings “x” and “y” has been included. The choice of names will be important in R — we will make consistent references to them when defining R operations.

Spreadsheet to generate the test data

Figure 1. Spreadsheet to generate the test data.

To create a CSV file from OpenOffice Calc, simply use SaveAs and choose the format option Text CSV. Here is a section of the resulting file:

x,y
0,24.8324297602
0.25,21.8520734699
0.5,24.207368359
...

There are two columns of data and 81 rows (not counting the header). Most programs that read and write CSV files (including Calc and R) have flexible input parsers that recognize strings and numbers in any valid format, including scientific notation.

Here is the R script that we will be using:

setwd("C:/RExamples/CSVStudies")
TestData = read.csv("csvdata.csv",header=TRUE)
plot(TestData$x,TestData$y)
Check15 = subset(TestData,x >= 12.5 & x <= 17.5)
Check15$x = Check15$x - 15.0
Fit15 = lm(y~x,Check15)
summary(Fit15)
plot(Check15$x,Check15$y)
abline(coef(Fit15))
write.csv(Check15,file="points15.csv",row.names=FALSE)

Run R and choose File/New script. Copy and paste the lines above into the script editor window. (It’s a good idea to save the script file for later reference.) We’ll step through the lines to learn some useful R operations. Set the cursor in the script editor to the first line and press Control-R to advance through each step.

It’s always a good idea to organize input/output files in a single directory. This command sets the working directory so you don’t have to enter full paths for file operations.

setwd("C:/RExamples/CSVStudies")

The following  command (the core of this article) loads the CSV data file into a data frame named TestData with 2 columns and 81 rows.

TestData = read.csv("csvdata.csv",header=TRUE)

The columns have the names “x” and “y“. The option header=TRUE signals to R that the first line of the file contains the column names. If the file has no header, there are options for the read.csv command to set the names explicitly. Go to the console window and type TestData to confirm that the information was loaded.

The next command

plot(TestData$x,TestData$y)

creates the plot of the input data shown as the top graph in Fig. 2. It is a noisy version of the curve of Eq. 1.

Input and interpolation data

Figure 2. Top: input data, red lines show the range used for the interpolation. Bottom: the interpolation data and the fitting line.

In a following article, we’ll see how to get an estimate of all the polynomial coefficients in Eq. 1. For now, we’ll concentrate on a local interpolation to estimate the value and slope at a single measurement point, x = 15. For this, we’ll include only data points near the measurement point. This command,

Check15 = subset(TestData,x >= 12.5 & x <= 17.5)

produces a data frame Check15 that is a subset of TestData. The selection criterion is that values in the column named “x” must be greater than 12.5 and less than 17.5. The range is shown by dashed red lines in Fig. 2. Check15 contains two columns named “x” and “y” with 21 rows. Again, type Check15 in the console to review the content.

We’ll use a routine that finds the slope and intercept of the most probable straight line for the data. The intercept equals the desired interpolation if the measurement point corresponds to x = 0.0. This command shifts all values in the “x” column of the new data frame by -15.0:

Check15$x = Check15$x - 15.0

We’re ready to do the fit, using the powerful and versatile lm command:

Fit15 = lm(y~x,Check15)

The command has two arguments: 1) a specification of the fitting function and 2) the data frame to be used. The function specification y~x is a subtle concept and merits a detailed explanation. An important point is that it is not the fitting equation, but rather an expression of dependencies in the fit.

  • The ~ symbol designates a functional relationship rather than an equality.
  • The y to the left of ~ specifies that values in the column named “y” correspond to the dependent variable.
  • The x to the right of ~ specifies that values in the column named “x” correspond to the independent variable.
  • The appearance of the single x to the right of ~ indicates that the fit should set a linear relationship, y = a + bx.

We’ll have a chance to discuss more complex formula specifications in following articles.

The lm operation creates Fit15, a special object. It is neither a vector nor a data frame, but rather a set of fitting coefficients and tolerances in a format specific to R. There are special operations in R to display or to extract information from fitting objects. For example, the operation summary(Fit15) produces the following display:

Call:
lm(formula = y ~ x, data = Check15)
Residuals:
Min      1Q  Median      3Q     Max
-2.3474 -2.1410  0.1371  1.1435  2.8666
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  59.7433     0.4041 147.851  < 2e-16 ***
x             2.5978     0.2669   9.732 8.13e-09 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.852 on 19 degrees of freedom
Multiple R-squared:  0.8329,    Adjusted R-squared:  0.8241
F-statistic: 94.72 on 1 and 19 DF,  p-value: 8.133e-09

This is probably more information than you’d want to know. The important results are that 1) the interpolated value at the point is y = 59.7433 +- 0.4041 and 2) the slope is dy/dx = 2.5978 +- 0.2669. The predictions from Eq. 1 are y = 59.7557 and dy/dx = 4.71 -0.6*15 + 0.03*15^2 = 2.4600.

Another function to extract information from a fit is coef(). For a linear fit, the routine returns two numbers, the intercept and slope. It so happens that there is a plotting command in R, abline(), that adds a straight-line plots to an existing plot and its arguments are the intercept and slope of the line. We can therefore understand the following command set:

plot(Check15$x,Check15$y)
abline(coef(Fit15))

The plot command displays y versus x over the range of x in Check15. The abline command adds the straight line determined by the fitting operation. The lower plot in Fig. 2 shows the result.

Finally, we can create a CSV file with R:

write.csv(Check15,file="points15.csv",row.names=FALSE)

The options indicate that row names and quotation marks should be excluded. Here is the initial section of the file point15.csv:

x,y
-2.5,56.0587661858
-2.25,54.0067949137
-2,54.5795616217
-1.75,57.4048454528
-1.5,56.623110328
...

If we didn’t include the options, the output of the write.csv() function would look like this:

"","x","y"
"51",-2.5,56.0587661858
"52",-2.25,54.0067949137
"53",-2,54.5795616217
"54",-1.75,57.4048454528
"55",-1.5,56.623110328
...

Footnotes

[1] Contact us : techinfo@fieldp.com.

[2] Field Precision home page: www.fieldp.com.