Quantifying the World

Case Study 2 - Real-Time Location System

Stacy Conant

January 21, 2020

Back to Top

Abstract

The proliferation of WiFi has enabled the use of real-time location systems (RTLS) inside buildings for the tracking of critical personnel and/or valuable equipment. The following case study extends the work of Nolan and Lang’s kNN analysis of data from a RTLS at the University of Mannheim. K-nearest neighbors (kNN) will be used to predict the location of a device within one floor of a campus building. Four kNN models that include either six or seven access point MAC addresses will be compared based on two different classification techniques, unweighted and weighted. Though the error estimates did not dramatically vary, the weighted kNN with seven access point MAC addresses attained the lowest estimated error.

Back to Top

Introduction

The use of Global Positioning Systems (GPS) has spread vastly over the past ten years and become a common part of most people’s everyday experience. Unfortunately, GPS does not work well inside of buildings. Instead, Indoor Positioning Systems (IPS) are used for locating people and objects indoors. With WiFi, an IPS can implement a real-time location systems (RTLS) that can keep track of items and/or people in real-time. These systems can be used for indoor location tracking of critical personnel or equipment, such as doctors, nurses and medical material in a hospital. They can also be used for many practical purposes such as inventory management, shipping management, or locating equipment in a warehouse.

For this case study, modeling and analysis of data from an RTLS in a building at the University of Mannheim will be conducted. Measurements of signal strength were recorded at 7 stationary WiFi access points within one floor of the building. Signal strengths were recorded 110 times for 166 locations in 8 orientations in 45 degree increments (0, 45, 90, etc.). This makes the dataset contain over one million data points.

This study will be an extention the work previously done on this dataset by Nolan and Lang. Their aim was to build a k-nearest neighbors (kNN) model for the location of a device as a function of the signal strength between the device and each access point. They then used their model to predict the location of a new unknown device based on the detected signals for the device. In their analysis, Nolan and Lang used only six access points of the seven total. They decided to match access points to their locations and keep the access point with MAC address 00:0f:a3:39:e1:c0 and to remove the data corresponding to MAC address 00:0f:a3:39:dd:cd. This study will add to their work by conducting kNN analysis on the omitted MAC, 00:0f:a3:39:dd:cd, and also fitting the model with both MAC addresses.

The task will be to process the data and prepare it for analysis, as Nolan and Lang did. Then use kNN to determine locations and to ascertain which of two MAC addresses or both, are most appropriate for inclusion in the RTLS for the building in question. Additionally, this study will implement an alternative prediction method. While kNN has proven to be a good approach to determining location for Nolan and Lang, there are simple alternatives such as implementing a weighted model. This technique would weight the signal strength inversely proportional to the distance from the test observation. Nearer points will have larger contribution to the kNN location calculation than the points that are farther away. The models will be compared our by using a function called calcError() that computes estimates against actuals for a measure of the size of the error.

building.png

Figure 1: The Floor Plan - Fixed access points are denoted by black square markers, the offline/training data was collected at the locations marked by grey dots, and the online measurements were recorded at randomly selected points indicated with black dots. The grey dots are spaced one meter apart
</h3>

Back to Top

Data

For the analysis, we will utilize two different datasets. The “offline” dataset will be used as the training data to develop the prediction model. The “online” dataset is used as the test set. Much of the data cleaning process is unraveling the original format of the data and extracting the variables to create a more useful dataset. It also involved converting variables to the appropriate data types for analysis and scaling variables, such as time correctly. All the records for adhoc measurements are dropped from data frame, which removes more than 100,000 rows from the dataset. Additionally, the scanMac, posZ, channel, and the type variables are removed because there was only one value for these variables or they were not relevant. The orientation variable values were rounded to correspond with eight angle increments and the 12 MAC addresses were sifted for the seven applicable addresses.

Data preparation will be code and descriptions appear in the appendix.

Both dataset share the variable units described below:

Variable Units

Variable Name Definition
t Timestamp in milliseconds since midnight, January 1, 1970 UTC
id MAC address of the scanning device
pos The physical coordinate of the scanning device
degree The orientation of the user carrying the scanning device in degrees (45 degree increments)
MAC MAC address of a responding peer (e.g., an access point or a device in adhoc mode) with the corresponding values for signal strength in dBm (Decibel-milliwatts), the channel frequency and its mode (access point = 3, device in adhoc mode = 1)

Back to Top

Methodology and Analysis

Just as Nolan and Logan did, this analysis will be using kNN with averaged X-Y coordinates for the first part of this study. The kNN algorithm can be used for classification, as well as regression tasks. It has becomes a widely used technique due to its simplicity, ease of interpretation and low calculation time. There are several distance measures that can be used for kNN, but for this study the Euclidean distance will be used where $S_i$ is the signal strength measured between the hand-held device and the $i$-th access point for a training observation taken at a specified location, and $S^∗_i$ is the signal measured between the same access point and our new point whose $(x, y)$ values are what is being predicted.

$\sqrt{(S_1^*-S_1)^2 + ...+ (S_6^*-S_6)^2}$

The last model in this study will utilized weighted X-Y coordinates, instead of averaged X-Y points. A weighted kNN is designed to give more weight to the data points which are close by and less weight to the points which are farther away. Instead of averaging, this modified kNN applies weights to the X-Y positions of each k closest points and take the sum of those results. This could increase the accuracy of the model. The weighted kNN again utilizes the Euclidean Distance, where the weights are

$\frac{1/di}{\sum_{i=1}^{k}{1/di}}$,

for the $i$-th closest neighboring observation where $d_i$ is the distance from the new point to the closest reference point in terms of signal strength.

The results from the four models are below. The models are compared by using a function called calcError() that computes estimates against actuals for a measure of the size of the error.

MACe1c0.png

Figure 2: SSE for k values 1 to 20 with MAC 00:0f:a3:39:e1:c0
</h3>

kNN with MAC 00:0f:a3:39:e1:c0

The graph above and the analysis indicates that for the model with six MACs, including 00:0f:a3:39:e1:c0, a $k =$ 6 produces the least amount of error. The SSE is 1198 and calcError function gives an error estimate of 272.4092.

SSEMACddcd.png

Figure 3: SSE for k values 1 to 20 with MAC 00:0f:a3:39:dd:cd
</h3>

kNN with MAC 00:0f:a3:39:dd:cd

The graph above and the analysis indicates that for the model with six MACs, including 00:0f:a3:39:dd:cd, a $k =$ 7 produces the least amount of error. The SSE is 1003 and calcError function gives an error estimate of 258.075.

SSEboth.png

Figure 4: SSE for k values 1 to 20 with both MACs
</h3>

kNN with both MAC addresses

The graph above and the analysis indicates that for the model with all seven MACs, including both 00:0f:a3:39:e1:c0 and 00:0f:a3:39:dd:cd, a $k =$ 8 produces the least amount of error. The SSE is 1459 and calcError function gives an error estimate of 235.196.

Because the model with both MACs has gave us the lowest error (235.196), we used both MACs in the weighted model as well.

SSEweighted.png

Figure 5: SSE for k values 1 to 20 weighted wit both MACs
</h3>

Weighted kNN with both MAC addresses

The graph above and the analysis indicates that for the weighted model with all seven MACs, including both 00:0f:a3:39:e1:c0 and 00:0f:a3:39:dd:cd, a $k =$ 9 produces the least amount of error. The SSE is 1396 and calcError function gives an error estimate of 227.4423.

Back to Top

Conclusion

The model that resulted in the lowest calculated error was the weighted model that includes all seven access points. Models with only six of the access points had higher estimated error than either model with all seven, weighted and unweighted.

kNN Model | $k$ | Estimated Error MAC 00:0f:a3:39:e1:c0 | 6 | 272.4092 MAC 00:0f:a3:39:dd:cd | 7 | 258.0750 Both MACs | 8 | 235.1960 Weighted with both MACs | 9 | 227.4423

By mirroring Nolan and Lang’s approach of preparing, exploring and reshaping the datasets, this study was able to successfully implement their kNN methodology for location detection, as well as, test additional models for comparison. Based on the results, the best course of action for the RTLS in the Mannheim University building would be to use a weighted model to achieve the most accurate results. In the future, the analysis and comparison of additional distance measures, such as Manhattan, could yield even better results. As the use of IPS and RTLS proliferate, studies such as these will become more important for efficient and accurate position location.

Appendix

Back to Top

Processing and Cleaning the Data:

Much of the code is from Nolan and Lang and/or the starter code from QTW

Back to Top

In [3]:
# load necessary libraries

#install.packages('fields', repos='http://cran.us.r-project.org')
#install.packages('tictoc', repos='http://cran.us.r-project.org')
library(codetools)
library(tictoc)
library(ggplot2)
library(lattice)
library(fields)
In [4]:
options(digits = 2)
# read in the entire file into a variable txt
# each line will be its own element
set.seed(15)
txt = readLines("offline.final.trace.txt")

#find all lines that begin with the '#' symbol
sum(substr(txt, 1, 1) == "#")

#total length of file (txt)
length(txt)
5312
151392
In [5]:
#Taking a look at the data
txt[1]
txt[2]
txt[3]
txt[4]
txt[5]
txt[6]
txt[7]
txt[8]
txt[9]
txt[10]
'# timestamp=2006-02-11 08:31:58'
'# usec=250'
'# minReadings=110'
't=1139643118358;id=00:02:2D:21:0F:33;pos=0.0,0.0,0.0;degree=0.0;00:14:bf:b1:97:8a=-38,2437000000,3;00:14:bf:b1:97:90=-56,2427000000,3;00:0f:a3:39:e1:c0=-53,2462000000,3;00:14:bf:b1:97:8d=-65,2442000000,3;00:14:bf:b1:97:81=-65,2422000000,3;00:14:bf:3b:c7:c6=-66,2432000000,3;00:0f:a3:39:dd:cd=-75,2412000000,3;00:0f:a3:39:e0:4b=-78,2462000000,3;00:0f:a3:39:e2:10=-87,2437000000,3;02:64:fb:68:52:e6=-88,2447000000,1;02:00:42:55:31:00=-84,2457000000,1'
't=1139643118744;id=00:02:2D:21:0F:33;pos=0.0,0.0,0.0;degree=0.0;00:14:bf:b1:97:8a=-38,2437000000,3;00:0f:a3:39:e1:c0=-54,2462000000,3;00:14:bf:b1:97:90=-56,2427000000,3;00:14:bf:3b:c7:c6=-67,2432000000,3;00:14:bf:b1:97:81=-66,2422000000,3;00:14:bf:b1:97:8d=-70,2442000000,3;00:0f:a3:39:e0:4b=-79,2462000000,3;00:0f:a3:39:dd:cd=-73,2412000000,3;00:0f:a3:39:e2:10=-83,2437000000,3;02:00:42:55:31:00=-85,2457000000,1'
't=1139643119002;id=00:02:2D:21:0F:33;pos=0.0,0.0,0.0;degree=0.0;00:14:bf:b1:97:8a=-38,2437000000,3;00:0f:a3:39:e1:c0=-54,2462000000,3;00:14:bf:b1:97:90=-57,2427000000,3;00:14:bf:b1:97:81=-66,2422000000,3;00:14:bf:3b:c7:c6=-69,2432000000,3;00:14:bf:b1:97:8d=-70,2442000000,3;00:0f:a3:39:e0:4b=-78,2462000000,3;00:0f:a3:39:e2:10=-83,2437000000,3;00:0f:a3:39:dd:cd=-65,2412000000,3;02:64:fb:68:52:e6=-90,2447000000,1'
't=1139643119263;id=00:02:2D:21:0F:33;pos=0.0,0.0,0.0;degree=0.0;00:14:bf:b1:97:8a=-38,2437000000,3;00:14:bf:b1:97:90=-52,2427000000,3;00:0f:a3:39:e1:c0=-54,2462000000,3;00:14:bf:b1:97:81=-64,2422000000,3;00:14:bf:3b:c7:c6=-68,2432000000,3;00:14:bf:b1:97:8d=-74,2442000000,3;00:0f:a3:39:dd:cd=-78,2412000000,3;00:0f:a3:39:e0:4b=-78,2462000000,3;00:0f:a3:39:e2:10=-83,2437000000,3;02:00:42:55:31:00=-84,2457000000,1;02:64:fb:68:52:e6=-87,2447000000,1'
't=1139643119538;id=00:02:2D:21:0F:33;pos=0.0,0.0,0.0;degree=0.0;00:14:bf:b1:97:8a=-46,2437000000,3;00:0f:a3:39:e1:c0=-55,2462000000,3;00:14:bf:b1:97:90=-57,2427000000,3;00:14:bf:3b:c7:c6=-67,2432000000,3;00:0f:a3:39:dd:cd=-66,2412000000,3;00:0f:a3:39:e0:4b=-80,2462000000,3;00:0f:a3:39:e2:10=-83,2437000000,3;00:14:bf:b1:97:81=-66,2422000000,3;02:00:42:55:31:00=-87,2457000000,1'
't=1139643119818;id=00:02:2D:21:0F:33;pos=0.0,0.0,0.0;degree=0.0;00:14:bf:b1:97:8a=-37,2437000000,3;00:0f:a3:39:e1:c0=-54,2462000000,3;00:14:bf:b1:97:81=-65,2422000000,3;00:14:bf:b1:97:8d=-67,2442000000,3;00:14:bf:3b:c7:c6=-67,2432000000,3;00:0f:a3:39:e0:4b=-79,2462000000,3;00:0f:a3:39:dd:cd=-67,2412000000,3;00:0f:a3:39:e2:10=-89,2437000000,3;02:00:42:55:31:00=-86,2457000000,1;02:64:fb:68:52:e6=-90,2447000000,1'
't=1139643120075;id=00:02:2D:21:0F:33;pos=0.0,0.0,0.0;degree=0.0;00:14:bf:b1:97:8a=-38,2437000000,3;00:0f:a3:39:e1:c0=-54,2462000000,3;00:14:bf:b1:97:90=-56,2427000000,3;00:14:bf:b1:97:8d=-72,2442000000,3;00:0f:a3:39:dd:cd=-73,2412000000,3;00:0f:a3:39:e0:4b=-78,2462000000,3;00:14:bf:b1:97:81=-65,2422000000,3;00:0f:a3:39:e2:10=-89,2437000000,3;02:64:fb:68:52:e6=-89,2447000000,1'
In [6]:
# split up line 4 by ";"  the [[1]] is because it returns a list (of length 1) containg the objects
print(strsplit(txt[4], ";")[[1]])
# split line 4 multple break symbols
tokens = strsplit(txt[4], "[;=,]")[[1]]
# list the first 10 tokens
print(tokens[1:10])
 [1] "t=1139643118358"                    "id=00:02:2D:21:0F:33"              
 [3] "pos=0.0,0.0,0.0"                    "degree=0.0"                        
 [5] "00:14:bf:b1:97:8a=-38,2437000000,3" "00:14:bf:b1:97:90=-56,2427000000,3"
 [7] "00:0f:a3:39:e1:c0=-53,2462000000,3" "00:14:bf:b1:97:8d=-65,2442000000,3"
 [9] "00:14:bf:b1:97:81=-65,2422000000,3" "00:14:bf:3b:c7:c6=-66,2432000000,3"
[11] "00:0f:a3:39:dd:cd=-75,2412000000,3" "00:0f:a3:39:e0:4b=-78,2462000000,3"
[13] "00:0f:a3:39:e2:10=-87,2437000000,3" "02:64:fb:68:52:e6=-88,2447000000,1"
[15] "02:00:42:55:31:00=-84,2457000000,1"
 [1] "t"                 "1139643118358"     "id"               
 [4] "00:02:2D:21:0F:33" "pos"               "0.0"              
 [7] "0.0"               "0.0"               "degree"           
[10] "0.0"              
In [7]:
# take a look at tokens in specific columns
tokens[c(2, 4, 6:8, 10)]
# take a look at all the columns except first 10
tokens[ - ( 1:10 ) ]
  1. '1139643118358'
  2. '00:02:2D:21:0F:33'
  3. '0.0'
  4. '0.0'
  5. '0.0'
  6. '0.0'
  1. '00:14:bf:b1:97:8a'
  2. '-38'
  3. '2437000000'
  4. '3'
  5. '00:14:bf:b1:97:90'
  6. '-56'
  7. '2427000000'
  8. '3'
  9. '00:0f:a3:39:e1:c0'
  10. '-53'
  11. '2462000000'
  12. '3'
  13. '00:14:bf:b1:97:8d'
  14. '-65'
  15. '2442000000'
  16. '3'
  17. '00:14:bf:b1:97:81'
  18. '-65'
  19. '2422000000'
  20. '3'
  21. '00:14:bf:3b:c7:c6'
  22. '-66'
  23. '2432000000'
  24. '3'
  25. '00:0f:a3:39:dd:cd'
  26. '-75'
  27. '2412000000'
  28. '3'
  29. '00:0f:a3:39:e0:4b'
  30. '-78'
  31. '2462000000'
  32. '3'
  33. '00:0f:a3:39:e2:10'
  34. '-87'
  35. '2437000000'
  36. '3'
  37. '02:64:fb:68:52:e6'
  38. '-88'
  39. '2447000000'
  40. '1'
  41. '02:00:42:55:31:00'
  42. '-84'
  43. '2457000000'
  44. '1'
In [8]:
# create a matrix of the data in the 'end' columns (not the first 10)
tmp = matrix(tokens[ - (1:10) ], ncol = 4, byrow = TRUE)
# now create a psuedo-pivot table using the first 10 columns as repeat labels and the last columns as indifidual 
# data points
# basically take 1 row and turn it into an 11 x 10 matrix.
mat = cbind(matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp),
                   ncol = 6, byrow = TRUE), 
            tmp)

dim(mat)
  1. 11
  2. 10
In [9]:
mat
1139643118358 00:02:2D:21:0F:330.0 0.0 0.0 0.0 00:14:bf:b1:97:8a-38 2437000000 3
1139643118358 00:02:2D:21:0F:330.0 0.0 0.0 0.0 00:14:bf:b1:97:90-56 2427000000 3
1139643118358 00:02:2D:21:0F:330.0 0.0 0.0 0.0 00:0f:a3:39:e1:c0-53 2462000000 3
1139643118358 00:02:2D:21:0F:330.0 0.0 0.0 0.0 00:14:bf:b1:97:8d-65 2442000000 3
1139643118358 00:02:2D:21:0F:330.0 0.0 0.0 0.0 00:14:bf:b1:97:81-65 2422000000 3
1139643118358 00:02:2D:21:0F:330.0 0.0 0.0 0.0 00:14:bf:3b:c7:c6-66 2432000000 3
1139643118358 00:02:2D:21:0F:330.0 0.0 0.0 0.0 00:0f:a3:39:dd:cd-75 2412000000 3
1139643118358 00:02:2D:21:0F:330.0 0.0 0.0 0.0 00:0f:a3:39:e0:4b-78 2462000000 3
1139643118358 00:02:2D:21:0F:330.0 0.0 0.0 0.0 00:0f:a3:39:e2:10-87 2437000000 3
1139643118358 00:02:2D:21:0F:330.0 0.0 0.0 0.0 02:64:fb:68:52:e6-88 2447000000 1
1139643118358 00:02:2D:21:0F:330.0 0.0 0.0 0.0 02:00:42:55:31:00-84 2457000000 1
In [10]:
# first stab at creating a function to parse lines.  Note that this causes warnings a few cells down
processLine =
function(x)
{
  tokens = strsplit(x, "[;=,]")[[1]]
  tmp = matrix(tokens[ - (1:10) ], ncol = 4, byrow = TRUE)
  cbind(matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp),
               ncol = 6, byrow = TRUE), tmp)
}
In [11]:
#doing some spot checking on our function
tmp = lapply(txt[4:20], processLine)
sapply(tmp, nrow)
offline = as.data.frame(do.call("rbind", tmp))
dim(offline)
  1. 11
  2. 10
  3. 10
  4. 11
  5. 9
  6. 10
  7. 9
  8. 9
  9. 10
  10. 11
  11. 11
  12. 9
  13. 9
  14. 9
  15. 8
  16. 10
  17. 14
  1. 170
  2. 10
In [12]:
#more spot checking--look we get a warning.  Let's deal with that warning
lines = txt[ substr(txt, 1, 1) != "#" ]
tmp = lapply(lines, processLine)
Warning message in matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp), ncol = 6, :
"data length exceeds size of matrix"Warning message in matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp), ncol = 6, :
"data length exceeds size of matrix"Warning message in matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp), ncol = 6, :
"data length exceeds size of matrix"Warning message in matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp), ncol = 6, :
"data length exceeds size of matrix"Warning message in matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp), ncol = 6, :
"data length exceeds size of matrix"Warning message in matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp), ncol = 6, :
"data length exceeds size of matrix"
In [13]:
# put it all together to process a line as a function
# note that the if statement handles null values to remove warnings
processLine = function(x)
{
  tokens = strsplit(x, "[;=,]")[[1]]
if (length(tokens) == 10)
  return(NULL)
tmp = matrix(tokens[ - (1:10) ], , 4, byrow = TRUE)
cbind(matrix(tokens[c(2, 4, 6:8, 10)], nrow(tmp), 6,
byrow = TRUE), tmp)
}
In [14]:
# do the same thing for offlin
options(error = recover, warn = 1)
tmp = lapply(lines, processLine)
offline = as.data.frame(do.call("rbind", tmp), 
                       stringsAsFactors = FALSE)

dim(offline)
  1. 1181628
  2. 10
In [15]:
length(lines)
146080
In [16]:
# Label our data
# Because we stack the data, the names have to be entered twice
names(offline) = c("time", "scanMac", "posX", "posY", "posZ", 
                   "orientation", "mac", "signal", 
                   "channel", "type")
# notice the repeated names
numVars = c("time", "posX", "posY", "posZ", 
            "orientation", "signal")
In [17]:
# apply the variable names
offline[ numVars ] =  lapply(offline[ numVars ], as.numeric)

offline = offline[ offline$type == "3", ]
offline = offline[ , "type" != names(offline) ]
dim(offline)
  1. 978443
  2. 9
In [18]:
head(offline)
timescanMacposXposYposZorientationmacsignalchannel
1.1e+12 00:02:2D:21:0F:330 0 0 0 00:14:bf:b1:97:8a-38 2437000000
1.1e+12 00:02:2D:21:0F:330 0 0 0 00:14:bf:b1:97:90-56 2427000000
1.1e+12 00:02:2D:21:0F:330 0 0 0 00:0f:a3:39:e1:c0-53 2462000000
1.1e+12 00:02:2D:21:0F:330 0 0 0 00:14:bf:b1:97:8d-65 2442000000
1.1e+12 00:02:2D:21:0F:330 0 0 0 00:14:bf:b1:97:81-65 2422000000
1.1e+12 00:02:2D:21:0F:330 0 0 0 00:14:bf:3b:c7:c6-66 2432000000

Creating and scaling time correctly

In [19]:
offline$rawTime = offline$time
offline$time = offline$time/1000
class(offline$time) = c("POSIXt", "POSIXct")

unlist(lapply(offline, class))
time1
'POSIXt'
time2
'POSIXct'
scanMac
'character'
posX
'numeric'
posY
'numeric'
posZ
'numeric'
orientation
'numeric'
mac
'character'
signal
'numeric'
channel
'character'
rawTime
'numeric'
In [20]:
head(offline)
timescanMacposXposYposZorientationmacsignalchannelrawTime
2006-02-11 08:31:5800:02:2D:21:0F:33 0 0 0 0 00:14:bf:b1:97:8a -38 2437000000 1.1e+12
2006-02-11 08:31:5800:02:2D:21:0F:33 0 0 0 0 00:14:bf:b1:97:90 -56 2427000000 1.1e+12
2006-02-11 08:31:5800:02:2D:21:0F:33 0 0 0 0 00:0f:a3:39:e1:c0 -53 2462000000 1.1e+12
2006-02-11 08:31:5800:02:2D:21:0F:33 0 0 0 0 00:14:bf:b1:97:8d -65 2442000000 1.1e+12
2006-02-11 08:31:5800:02:2D:21:0F:33 0 0 0 0 00:14:bf:b1:97:81 -65 2422000000 1.1e+12
2006-02-11 08:31:5800:02:2D:21:0F:33 0 0 0 0 00:14:bf:3b:c7:c6 -66 2432000000 1.1e+12
In [21]:
# do a summary exam of data 
summary(offline[, numVars])
      time                          posX         posY           posZ  
 Min.   :2006-02-11 08:31:58   Min.   : 0   Min.   : 0.0   Min.   :0  
 1st Qu.:2006-02-11 14:21:27   1st Qu.: 2   1st Qu.: 3.0   1st Qu.:0  
 Median :2006-02-11 20:57:58   Median :12   Median : 6.0   Median :0  
 Mean   :2006-02-16 15:57:37   Mean   :14   Mean   : 5.9   Mean   :0  
 3rd Qu.:2006-02-19 15:52:40   3rd Qu.:23   3rd Qu.: 8.0   3rd Qu.:0  
 Max.   :2006-03-09 21:41:10   Max.   :33   Max.   :13.0   Max.   :0  
  orientation      signal   
 Min.   :  0   Min.   :-99  
 1st Qu.: 90   1st Qu.:-69  
 Median :180   Median :-60  
 Mean   :167   Mean   :-62  
 3rd Qu.:270   3rd Qu.:-53  
 Max.   :360   Max.   :-25  
In [22]:
# summary of labels
summary(sapply(offline[ , c("mac", "channel", "scanMac")],
                as.factor))
                mac               channel                    scanMac      
 00:0f:a3:39:e1:c0:145862   2462000000:189774   00:02:2D:21:0F:33:978443  
 00:0f:a3:39:dd:cd:145619   2437000000:152124                             
 00:14:bf:b1:97:8a:132962   2412000000:145619                             
 00:14:bf:3b:c7:c6:126529   2432000000:126529                             
 00:14:bf:b1:97:90:122315   2427000000:122315                             
 00:14:bf:b1:97:8d:121325   2442000000:121325                             
 (Other)          :183831   (Other)   :120757                             

Eliminating variable scanMac and posZ because thte only have one value.

In [23]:
# eliminate unused variables
offline = offline[ , !(names(offline) %in% c("scanMac", "posZ"))]
In [24]:
head(offline)
timeposXposYorientationmacsignalchannelrawTime
2006-02-11 08:31:580 0 0 00:14:bf:b1:97:8a -38 2437000000 1.1e+12
2006-02-11 08:31:580 0 0 00:14:bf:b1:97:90 -56 2427000000 1.1e+12
2006-02-11 08:31:580 0 0 00:0f:a3:39:e1:c0 -53 2462000000 1.1e+12
2006-02-11 08:31:580 0 0 00:14:bf:b1:97:8d -65 2442000000 1.1e+12
2006-02-11 08:31:580 0 0 00:14:bf:b1:97:81 -65 2422000000 1.1e+12
2006-02-11 08:31:580 0 0 00:14:bf:3b:c7:c6 -66 2432000000 1.1e+12

Exploring Orientation

The orientation, or angle to the sensor at which the measurement was taken, also needed some cleaning to group the angles into the appropriate 8 orientations. This was accomplished be rounding the angles to the closest of the 45 degree increments.

In [25]:
# get unique orientations
length(unique(offline$orientation))
203
In [26]:
# Create a plot of the orientation
# not my personal choice but it works
plot(ecdf(offline$orientation), pch = 19, cex = 0.3,
     xlim = c(-5, 365), axes = FALSE,
     xlab = "orientation", ylab = "Empirical CDF", main = "")
box()
axis(2)
axis(side = 1, at = seq(0, 360, by = 45))

The graph above is the empirical distribution function of orientation showing that there are 8 orientations that are 45 degrees apart.

In [27]:
# create a function that will round off to the nearest major angle
roundOrientation = function(angles) {
  refs = seq(0, by = 45, length  = 9)
  q = sapply(angles, function(o) which.min(abs(o - refs)))
  c(refs[1:8], 0)[q]
}

# we built it, now apply it
offline$angle = roundOrientation(offline$orientation)
In [28]:
# plot of rounded angles
with(offline, boxplot(orientation ~ angle,
xlab = "nearest 45 degree angle",
ylab="orientation"))

This graph above is the boxplot of the original orientation verus the rounded value to confirm that the values have mapped to the correct angles

MAC addresses

In [29]:
# find number of macids and channels
c(length(unique(offline$mac)), length(unique(offline$channel)))

# create a table for all the count of macid readings
# default is count
table(offline$mac)
  1. 12
  2. 8
00:04:0e:5c:23:fc 00:0f:a3:39:dd:cd 00:0f:a3:39:e0:4b 00:0f:a3:39:e1:c0 
              418            145619             43508            145862 
00:0f:a3:39:e2:10 00:14:bf:3b:c7:c6 00:14:bf:b1:97:81 00:14:bf:b1:97:8a 
            19162            126529            120339            132962 
00:14:bf:b1:97:8d 00:14:bf:b1:97:90 00:30:bd:f8:7f:c5 00:e0:63:82:8b:a9 
           121325            122315               301               103 
In [30]:
# get names of the macs (top 7)
subMacs = names(sort(table(offline$mac), decreasing = TRUE))[1:7]
offline = offline[ offline$mac %in% subMacs, ]


macChannel = with(offline, table(mac, channel))
apply(macChannel, 1, function(x) sum(x > 0))

# channel and mac are 1:1 so drop channel
offline = offline[ , "channel" != names(offline)]
00:0f:a3:39:dd:cd
1
00:0f:a3:39:e1:c0
1
00:14:bf:3b:c7:c6
1
00:14:bf:b1:97:81
1
00:14:bf:b1:97:8a
1
00:14:bf:b1:97:8d
1
00:14:bf:b1:97:90
1

Position of the Hand-held device

In [31]:
# Create a list containing a dataframe for each location
locDF = with(offline, 
             by(offline, list(posX, posY), function(x) x))
length(locDF)
476

After dropping nulls, there are 166 locations

In [32]:
# find the number of nulls
sum(sapply(locDF, is.null))

# drop the nulls
locDF = locDF[ !sapply(locDF, is.null) ]

# how much do we have left
length(locDF)
310
166

Obtain the number of observations at each location

In [33]:
# get obs at each location and then store the data
locCounts = sapply(locDF, nrow)

locCounts = sapply(locDF, 
                   function(df) 
                     c(df[1, c("posX", "posY")], count = nrow(df)))

class(locCounts)

dim(locCounts)
# take a look at the first 8 only (bottom left of following plot)
locCounts[ , 1:8]
'matrix'
  1. 3
  2. 166
posX01201201
posY00011122
count55055505550655245543555855035564
In [34]:
#Create graph
locCounts = t(locCounts)
plot(locCounts, type = "n", xlab = "", ylab = "")
text(locCounts, labels = locCounts[,3], cex = .8, srt = 45)

The chart above is the counts of signals detected at each position for the offline data. There are close to 5500 recordings at each location.

Start of Modeling and Data Analysis

Back to Top

Now that the data has been cleaned, the analysis can move forward.

In [35]:
# re-read our data combining all the anlysis we did here there are 7 SEVEN mac ids

readData = 
  function(filename = 'offline.final.trace.txt', 
           subMacs = c("00:0f:a3:39:e1:c0", "00:0f:a3:39:dd:cd", "00:14:bf:b1:97:8a",
                       "00:14:bf:3b:c7:c6", "00:14:bf:b1:97:90", "00:14:bf:b1:97:8d",
                       "00:14:bf:b1:97:81"))
  {
    txt = readLines(filename)
    lines = txt[ substr(txt, 1, 1) != "#" ]
    tmp = lapply(lines, processLine)
    offline = as.data.frame(do.call("rbind", tmp), 
                            stringsAsFactors= FALSE) 
    
    names(offline) = c("time", "scanMac", 
                       "posX", "posY", "posZ", "orientation", 
                       "mac", "signal", "channel", "type")
    
     # keep only signals from access points
    offline = offline[ offline$type == "3", ]
    
    # drop scanMac, posZ, channel, and type - no info in them
    dropVars = c("scanMac", "posZ", "channel", "type")
    offline = offline[ , !( names(offline) %in% dropVars ) ]
    
    # drop more unwanted access points
    offline = offline[ offline$mac %in% subMacs, ]
    
    # convert numeric values
    numVars = c("time", "posX", "posY", "orientation", "signal")
    offline[ numVars ] = lapply(offline[ numVars ], as.numeric)

    # convert time to POSIX
    offline$rawTime = offline$time
    offline$time = offline$time/1000
    class(offline$time) = c("POSIXt", "POSIXct")
    
    # round orientations to nearest 45
    offline$angle = roundOrientation(offline$orientation)
      
    return(offline)
  }
In [36]:
# implement our function
offlineRedo = readData()

identical(offline, offlineRedo)
TRUE

Understanding Signal Strength

Signal strength is our response variable so it is important to visually inspect the signal for each MAC address.

In [37]:
# plot signal strength for each device - all 7
# poor mac id 00:0f:a3:39:dd:cd. he gets dropped as he did in Nolan and Lang's analysis
library(lattice)
bwplot(signal ~ factor(angle) | mac, data = offline, 
       subset = posX == 2 & posY == 12, 
       layout = c(1,7))

The above boxplots show signal strength by angle for all 7 access points

In [38]:
# plot signal strength for the two MAC address that will be swapped in the models
# this time 00:0f:a3:39:e1:c0 is dropped
library(lattice)
bwplot(signal ~ factor(angle) | mac, data = offline, 
       subset = posX == 2 & posY == 12 
                & mac == c("00:0f:a3:39:e1:c0", "00:0f:a3:39:dd:cd"), 
       layout = c(2,1))
Warning message in mac == c("00:0f:a3:39:e1:c0", "00:0f:a3:39:dd:cd"):
"longer object length is not a multiple of shorter object length"

By comparing the two boxplots above of signal strength by angle for each of the access points that will be compared in this study we can see that the MAC address 00:0f:a3:39:e1:c0 looks to have higher signal strength.

In [39]:
# examine signal strength overall
summary(offline$signal)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    -98     -67     -59     -60     -53     -25 
In [40]:
# signal density plots for ALL seven mac ids at X=24, Y=4
densityplot( ~ signal | mac + factor(angle), data = offline,
             subset = posX == 24 & posY == 4,
             bw = 0.5, plot.points = FALSE)

This graph displays the distributions of signal by angle for each access point. Some are display a semi-normal distribution and some are skewed.

In [41]:
# code to examine signal strength distribution for all 166 locations,  angles, and 6 access points
#Setup all the data using the data summary
offline$posXY = paste(offline$posX, offline$posY, sep = "-")

byLocAngleAP = with(offline, 
                    by(offline, list(posXY, angle, mac), 
                       function(x) x))

signalSummary = 
  lapply(byLocAngleAP,            
         function(oneLoc) {
           ans = oneLoc[1, ]
           ans$medSignal = median(oneLoc$signal)
           ans$avgSignal = mean(oneLoc$signal)
           ans$num = length(oneLoc$signal)
           ans$sdSignal = sd(oneLoc$signal)
           ans$iqrSignal = IQR(oneLoc$signal)
           ans
           })

offlineSummary = do.call("rbind", signalSummary)
In [42]:
# look at the distribution of signal strength
breaks = seq(-90, -30, by = 5)
bwplot(sdSignal ~ cut(avgSignal, breaks = breaks),
       data = offlineSummary, 
       subset = mac != "00:0f:a3:39:dd:cd",
       xlab = "Mean Signal", ylab = "SD Signal")

The above box plot of standard deviation of signal strength by mean signal strength indicates that stronger signals have more variability while weaker signals have low variability.

In [43]:
oneAPAngle = subset(offlineSummary, 
                    mac == subMacs[5] & angle == 0)


library(fields)
smoothSS = Tps(oneAPAngle[, c("posX","posY")], 
               oneAPAngle$avgSignal)

vizSmooth = predictSurface(smoothSS)

plot.surface(vizSmooth, type = "C")

points(oneAPAngle$posX, oneAPAngle$posY, pch=19, cex = 0.5)

The graph above illustrates the location of the access point as the dark red region at the top of the image. We also confirm the effect of the orientation on signal strength.

In [44]:
surfaceSS = function(data, mac, angle = 45) {
  require(fields)
  oneAPAngle = data[ data$mac == mac & data$angle == angle, ]
  smoothSS = Tps(oneAPAngle[, c("posX","posY")], 
                 oneAPAngle$avgSignal)
  vizSmooth = predictSurface(smoothSS)
  plot.surface(vizSmooth, type = "C", 
               xlab = "", ylab = "", xaxt = "n", yaxt = "n")
  points(oneAPAngle$posX, oneAPAngle$posY, pch=19, cex = 0.5) 
}

KNN with MAC 00:0f:a3:39:e1:c0

In [45]:
# creating a backup of original
# drop 00:0f:a3:39:dd:cd
offlineSummary_backup = offlineSummary
offlineSummary <- offlineSummary[ offlineSummary$mac != "00:0f:a3:39:dd:cd", ]
In [46]:
# Look at the access points
# signal strength vs distance

AP = matrix( c( 7.5, 6.3, 2.5, -.8, 12.8, -2.8,  
                1, 14, 33.5, 9.3,  33.5, 2.8),
            ncol = 2, byrow = TRUE,
            dimnames = list(subMacs[ -2 ], c("x", "y") ))

AP

diffs = offlineSummary[ , c("posX", "posY")] - 
          AP[ offlineSummary$mac, ]

offlineSummary$dist = sqrt(diffs[ , 1]^2 + diffs[ , 2]^2)

xyplot(signal ~ dist | factor(mac) + factor(angle), 
       data = offlineSummary, pch = 19, cex = 0.3,
       xlab ="distance")
xy
00:0f:a3:39:e1:c0 7.5 6.3
00:14:bf:b1:97:8a 2.5-0.8
00:14:bf:3b:c7:c612.8-2.8
00:14:bf:b1:97:90 1.014.0
00:14:bf:b1:97:8d33.5 9.3
00:14:bf:b1:97:8133.5 2.8

The above plots show the signal strength against the distance to the access points.

Below the "online" data is read in as our test data.

In [47]:
# tally signal strength

macs = unique(offlineSummary$mac)
online = readData("online.final.trace.txt", subMacs = macs)

online$posXY = paste(online$posX, online$posY, sep = "-")

length(unique(online$posXY))

tabonlineXYA = table(online$posXY, online$angle)
tabonlineXYA[1:6, ]


#organize the data so that we have 6 columns of signal strengths, i.e., one for  each of the access points.
keepVars = c("posXY", "posX","posY", "orientation", "angle")
byLoc = with(online, 
             by(online, list(posXY), 
                function(x) {
                  ans = x[1, keepVars]
                  avgSS = tapply(x$signal, x$mac, mean)
                  y = matrix(avgSS, nrow = 1, ncol = 6,
                        dimnames = list(ans$posXY, names(avgSS)))
                  cbind(ans, y)
                }))

onlineSummary = do.call("rbind", byLoc)  

# then confirm with dim() and names()
dim(onlineSummary)

names(onlineSummary)
60
            
               0  45  90 135 180 225 270 315
  0-0.05       0   0   0 593   0   0   0   0
  0.15-9.42    0   0 606   0   0   0   0   0
  0.31-11.09   0   0   0   0   0 573   0   0
  0.47-8.2   590   0   0   0   0   0   0   0
  0.78-10.94 586   0   0   0   0   0   0   0
  0.93-11.69   0   0   0   0 583   0   0   0
  1. 60
  2. 11
  1. 'posXY'
  2. 'posX'
  3. 'posY'
  4. 'orientation'
  5. 'angle'
  6. '00:0f:a3:39:e1:c0'
  7. '00:14:bf:3b:c7:c6'
  8. '00:14:bf:b1:97:81'
  9. '00:14:bf:b1:97:8a'
  10. '00:14:bf:b1:97:8d'
  11. '00:14:bf:b1:97:90'

Signal strength were recorded at one orientation for each location

In [48]:
# create data frame and functions to aggregate/select data with similar angles


m = 3; angleNewObs = 230
refs = seq(0, by = 45, length  = 8)
nearestAngle = roundOrientation(angleNewObs)
  
if (m %% 2 == 1) {
  angles = seq(-45 * (m - 1) /2, 45 * (m - 1) /2, length = m)
} else {
  m = m + 1
  angles = seq(-45 * (m - 1) /2, 45 * (m - 1) /2, length = m)
  if (sign(angleNewObs - nearestAngle) > -1) 
    angles = angles[ -1 ]
  else 
    angles = angles[ -m ]
}
angles = angles + nearestAngle
angles[angles < 0] = angles[ angles < 0 ] + 360
angles[angles > 360] = angles[ angles > 360 ] - 360

offlineSubset = 
  offlineSummary[ offlineSummary$angle %in% angles, ]

reshapeSS = function(data, varSignal = "signal", 
                     keepVars = c("posXY", "posX","posY")) {
  byLocation =
    with(data, by(data, list(posXY), 
                  function(x) {
                    ans = x[1, keepVars]
                    avgSS = tapply(x[ , varSignal ], x$mac, mean)
                    y = matrix(avgSS, nrow = 1, ncol = 6,
                               dimnames = list(ans$posXY,
                                               names(avgSS)))
                    cbind(ans, y)
                  }))

  newDataSS = do.call("rbind", byLocation)
  return(newDataSS)
}
In [49]:
#Summarize and reshape offlineSubset
trainSS = reshapeSS(offlineSubset, varSignal = "avgSignal")

selectTrain = function(angleNewObs, signals = NULL, m = 1){
  # m is the number of angles to keep between 1 and 5
  refs = seq(0, by = 45, length  = 8)
  nearestAngle = roundOrientation(angleNewObs)
  
  if (m %% 2 == 1) 
    angles = seq(-45 * (m - 1) /2, 45 * (m - 1) /2, length = m)
  else {
    m = m + 1
    angles = seq(-45 * (m - 1) /2, 45 * (m - 1) /2, length = m)
    if (sign(angleNewObs - nearestAngle) > -1) 
      angles = angles[ -1 ]
    else 
      angles = angles[ -m ]
  }
  angles = angles + nearestAngle
  angles[angles < 0] = angles[ angles < 0 ] + 360
  angles[angles > 360] = angles[ angles > 360 ] - 360
  angles = sort(angles) 
  
  offlineSubset = signals[ signals$angle %in% angles, ]
  reshapeSS(offlineSubset, varSignal = "avgSignal")
}

train130 = selectTrain(130, offlineSummary, m = 3)

head(train130)

length(train130[[1]])
posXYposXposY00:0f:a3:39:e1:c000:14:bf:3b:c7:c600:14:bf:b1:97:8100:14:bf:b1:97:8a00:14:bf:b1:97:8d00:14:bf:b1:97:90
0-00-0 0 0 -52 -66 -63 -36 -64 -55
0-10-1 0 1 -53 -65 -64 -39 -65 -59
0-100-100 10 -56 -66 -69 -45 -67 -50
0-110-110 11 -55 -67 -70 -48 -67 -55
0-120-120 12 -56 -70 -72 -45 -67 -50
0-130-130 13 -55 -71 -73 -43 -69 -54
166

Now the training data can be used in conjuction with the online data for analysis. The following code will run the first kNN model with MAC 00:0f:a3:39:e1:c0 and compare the nearest neighbor and 3 nearest neighbors.

In [50]:
# here is the kNN function.
findNN = function(newSignal, trainSubset) {
  diffs = apply(trainSubset[ , 4:9], 1, 
                function(x) x - newSignal)
  dists = apply(diffs, 2, function(x) sqrt(sum(x^2)) )
  closest = order(dists)
  return(trainSubset[closest, 1:3 ])
}
In [51]:
# predict X-Y based on the the neasest k neighbors (default 3)
predXY = function(newSignals, newAngles, trainData, 
                  numAngles = 1, k = 3){
  
  closeXY = list(length = nrow(newSignals))
  
  for (i in 1:nrow(newSignals)) {
    trainSS = selectTrain(newAngles[i], trainData, m = numAngles)
    closeXY[[i]] = 
      findNN(newSignal = as.numeric(newSignals[i, ]), trainSS)
  }

  estXY = lapply(closeXY, 
                 function(x) sapply(x[ , 2:3], 
                                    function(x) mean(x[1:k])))
  estXY = do.call("rbind", estXY)
  return(estXY)
}
                                    
# nearest 3 neighbors                                    
                                    
estXYk3 = predXY(newSignals = onlineSummary[ , 6:11], 
                 newAngles = onlineSummary[ , 4], 
                 offlineSummary, numAngles = 3, k = 3)

# nearest 1 neighbor
estXYk1 = predXY(newSignals = onlineSummary[ , 6:11], 
                 newAngles = onlineSummary[ , 4], 
                 offlineSummary, numAngles = 3, k = 1)
In [52]:
# predict and map errors
floorErrorMap = function(estXY, actualXY, trainPoints = NULL, AP = NULL){
  
    plot(0, 0, xlim = c(0, 35), ylim = c(-3, 15), type = "n",
         xlab = "", ylab = "", axes = FALSE)
    box()
    if ( !is.null(AP) ) points(AP, pch = 15)
    if ( !is.null(trainPoints) )
      points(trainPoints, pch = 19, col="grey", cex = 0.6)
    
    points(x = actualXY[, 1], y = actualXY[, 2], 
           pch = 19, cex = 0.8 )
    points(x = estXY[, 1], y = estXY[, 2], 
           pch = 8, cex = 0.8 )
    segments(x0 = estXY[, 1], y0 = estXY[, 2],
             x1 = actualXY[, 1], y1 = actualXY[ , 2],
             lwd = 2, col = "red")
}

trainPoints = offlineSummary[ offlineSummary$angle == 0 & 
                              offlineSummary$mac == "00:0f:a3:39:e1:c0" ,
                        c("posX", "posY")]

# 3 NN

floorErrorMap(estXYk3, onlineSummary[ , c("posX","posY")], 
              trainPoints = trainPoints, AP = AP)


# 1 NN
floorErrorMap(estXYk1, onlineSummary[ , c("posX","posY")], 
              trainPoints = trainPoints, AP = AP)

The graphs above represents the floor plan of the building with both the predicted and the actual locations. The red lines connect the test locations (black dots) to their predicted locations (asterisks). The top plot shows the predictions for $k$ = 3 and the bottom plot is for $k$ = 1 nearest neighbors for the model with MAC address 00:0f:a3:39:e1:c0. $K$=3 looks like a better approximation.

Here, the sum of squared error function is created for comparing models. The error for $k$=1 is 659.4 and for $k$=3 error is 306.7. Meaning that including 3 nearest neighbors gives a more accurate location.

Next, cross-validated selection of $k$ performed. The training data is divided into v non-overlapping subsets of equal size. Each subset a model is built with the data that are not in that subset and the predictive ability of the model is assessed using the subset that was left out. This process is repeated and assessed for each of the v folds and the prediciton errors are aggregated across the folds.

set.seed(15) was used for all models.

In [53]:
set.seed(15)
options(error = recover, warn = 1)
calcError = 
function(estXY, actualXY) 
   sum( rowSums( (estXY - actualXY)^2) )

actualXY = onlineSummary[ , c("posX", "posY")]
sapply(list(estXYk1, estXYk3), calcError, actualXY)
    
v = 11
permuteLocs = sample(unique(offlineSummary$posXY))
permuteLocs = matrix(permuteLocs, ncol = v, 
                     nrow = floor(length(permuteLocs)/v))

onlineFold = subset(offlineSummary, posXY %in% permuteLocs[ , 1])

reshapeSS = function(data, varSignal = "signal", 
                     keepVars = c("posXY", "posX","posY"),
                     sampleAngle = FALSE, 
                     refs = seq(0, 315, by = 45)) {
  byLocation =
    with(data, by(data, list(posXY), 
                  function(x) {
                    if (sampleAngle) {
                      x = x[x$angle == sample(refs, size = 1), ]}
                    ans = x[1, keepVars]
                    avgSS = tapply(x[ , varSignal ], x$mac, mean)
                    y = matrix(avgSS, nrow = 1, ncol = 6,
                               dimnames = list(ans$posXY,
                                               names(avgSS)))
                    cbind(ans, y)
                  }))

  newDataSS = do.call("rbind", byLocation)
  return(newDataSS)
}
  1. 659.4003
  2. 306.702522222222
Warning message in matrix(permuteLocs, ncol = v, nrow = floor(length(permuteLocs)/v)):
"data length [166] is not a sub-multiple or multiple of the number of rows [15]"
In [54]:
# up to 20 neighbors, 11 folds
# this one can run for a while (5-10 mins)
# this cell and the next are the same, but the angles change slightly!!
offline = offline[ offline$mac != "00:0f:a3:39:dd:cd", ]

keepVars = c("posXY", "posX","posY", "orientation", "angle")
set.seed(15)
onlineCVSummary = reshapeSS(offline, keepVars = keepVars, 
                            sampleAngle = TRUE)

onlineFold = subset(onlineCVSummary, 
                    posXY %in% permuteLocs[ , 1])

offlineFold = subset(offlineSummary,
                     posXY %in% permuteLocs[ , -1])

estFold = predXY(newSignals = onlineFold[ , 6:11], 
                 newAngles = onlineFold[ , 4], 
                 offlineFold, numAngles = 3, k = 3)

actualFold = onlineFold[ , c("posX", "posY")]
calcError(estFold, actualFold)

K = 20
err = rep(0, K)

for (j in 1:v) {
  onlineFold = subset(onlineCVSummary, 
                      posXY %in% permuteLocs[ , j])
  offlineFold = subset(offlineSummary,
                       posXY %in% permuteLocs[ , -j])
  actualFold = onlineFold[ , c("posX", "posY")]
  
  for (k in 1:K) {
    estFold = predXY(newSignals = onlineFold[ , 6:11],
                     newAngles = onlineFold[ , 4], 
                     offlineFold, numAngles = 3, k = k)
    err[k] = err[k] + calcError(estFold, actualFold)
  }
}
49.3333333333333
In [55]:
plot(y = err, x = (1:K),  type = "l", lwd= 2, 
     ylim = c(1000, 2100),
     xlab = "Number of Neighbors",
     ylab = "Sum of Square Errors")
title("SSE for k values 1 to 20 with MAC 00:0f:a3:39:e1:c0")
rmseMin = min(err)
kMin = which(err == rmseMin)[1]
rmseMin
kMin
segments(x0 = 0, x1 = kMin, y0 = rmseMin, col = gray(0.4), 
         lty = 2, lwd = 2)
segments(x0 = kMin, x1 = kMin, y0 = 975,  y1 = rmseMin, 
         col = gray(0.4), lty = 2, lwd = 2)

text(x = kMin - 2, y = rmseMin + 40, 
     label = as.character(round(rmseMin)), col = grey(0.4))
1197.58333333333
6
In [56]:
#Estimated Error

estXYk6 = predXY(newSignals = onlineSummary[ , 6:11], 
                 newAngles = onlineSummary[ , 4], 
                 offlineSummary, numAngles = 3, k = 6)

rmseMin
calcError(estXYk6, actualXY)
1197.58333333333
272.409188888889
In [57]:
predXY2 = function(newSignals, newAngles, trainData, 
                  numAngles = 3, k = 3){
  
  closeXY = list(length = nrow(newSignals))
  
  for (i in 1:nrow(newSignals)) {
    trainSS = selectTrain(newAngles[i], trainData, m = numAngles)
    closeXY[[i]] = findNN(newSignal = as.numeric(newSignals[i, ]),
                           trainSS)
  }

  estXY = lapply(closeXY, function(x)
                            sapply(x[ , 2:3], 
                                    function(x) mean(x[1:k])))
  estXY = do.call("rbind", estXY)
  return(estXY)
}

KNN with 00:0f:a3:39:dd:cd

Analysis is performed in the same method as the first model except without MAC 00:0f:a3:39:e1:c0

In [58]:
#Reset data and omit the Mac 00:0f:a3:39:e1:c0
subMacs = c("00:0f:a3:39:e1:c0", "00:0f:a3:39:dd:cd", "00:14:bf:b1:97:8a",
                       "00:14:bf:3b:c7:c6", "00:14:bf:b1:97:90", "00:14:bf:b1:97:8d", "00:14:bf:b1:97:81")
unusedMac = "00:0f:a3:39:e1:c0"
Mac2 = subMacs[subMacs != unusedMac] 
offlineRedo = readData(subMacs = Mac2)
In [59]:
summary(offlineRedo$signal)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    -98     -69     -61     -61     -54     -25 
In [60]:
offlineRedo$posXY = paste(offlineRedo$posX, offlineRedo$posY, sep = "-")

byLocAngleAP = with(offlineRedo, 
                    by(offlineRedo, list(posXY, angle, mac), 
                       function(x) x))

signalSummary = 
  lapply(byLocAngleAP,            
         function(oneLoc) {
           ans = oneLoc[1, ]
           ans$medSignal = median(oneLoc$signal)
           ans$avgSignal = mean(oneLoc$signal)
           ans$num = length(oneLoc$signal)
           ans$sdSignal = sd(oneLoc$signal)
           ans$iqrSignal = IQR(oneLoc$signal)
           ans
           })

offlineSummary = do.call("rbind", signalSummary)
In [61]:
breaks = seq(-90, -30, by = 5)
bwplot(sdSignal ~ cut(avgSignal, breaks = breaks),
       data = offlineSummary, 
       subset = mac != unusedMac,
       xlab = "Mean Signal", ylab = "SD Signal")

Above is the box plot of the standard deviation of signal strength by mean signal strength

In [62]:
AP = matrix( c( 7.5, 6.3, 2.5, -.8, 12.8, -2.8,  
                1, 14, 33.5, 9.3,  33.5, 2.8),
            ncol = 2, byrow = TRUE,
            dimnames = list(subMacs[ -1 ], c("x", "y") ))

AP

diffs = offlineSummary[ , c("posX", "posY")] - 
          AP[ offlineSummary$mac, ]

offlineSummary$dist = sqrt(diffs[ , 1]^2 + diffs[ , 2]^2)

xyplot(signal ~ dist | factor(mac) + factor(angle), 
       data = offlineSummary, pch = 19, cex = 0.3,
       xlab ="distance")
xy
00:0f:a3:39:dd:cd 7.5 6.3
00:14:bf:b1:97:8a 2.5-0.8
00:14:bf:3b:c7:c612.8-2.8
00:14:bf:b1:97:90 1.014.0
00:14:bf:b1:97:8d33.5 9.3
00:14:bf:b1:97:8133.5 2.8

The above graph displays signal strength against distance to access point

In [63]:
macs = unique(offlineSummary$mac)
online = readData("online.final.trace.txt", subMacs = macs)

online$posXY = paste(online$posX, online$posY, sep = "-")

length(unique(online$posXY))

tabonlineXYA = table(online$posXY, online$angle)
tabonlineXYA[1:6, ]

keepVars = c("posXY", "posX","posY", "orientation", "angle")
byLoc = with(online, 
             by(online, list(posXY), 
                function(x) {
                  ans = x[1, keepVars]
                  avgSS = tapply(x$signal, x$mac, mean)
                  y = matrix(avgSS, nrow = 1, ncol = 6,
                        dimnames = list(ans$posXY, names(avgSS)))
                  cbind(ans, y)
                }))

onlineSummary = do.call("rbind", byLoc)
60
            
               0  45  90 135 180 225 270 315
  0-0.05       0   0   0 594   0   0   0   0
  0.15-9.42    0   0 608   0   0   0   0   0
  0.31-11.09   0   0   0   0   0 574   0   0
  0.47-8.2   591   0   0   0   0   0   0   0
  0.78-10.94 585   0   0   0   0   0   0   0
  0.93-11.69   0   0   0   0 581   0   0   0
In [64]:
dim(onlineSummary)
  1. 60
  2. 11
In [ ]:

In [65]:
names(onlineSummary)
m = 3; angleNewObs = 230
refs = seq(0, by = 45, length  = 8)
nearestAngle = roundOrientation(angleNewObs)
  
if (m %% 2 == 1) {
  angles = seq(-45 * (m - 1) /2, 45 * (m - 1) /2, length = m)
} else {
  m = m + 1
  angles = seq(-45 * (m - 1) /2, 45 * (m - 1) /2, length = m)
  if (sign(angleNewObs - nearestAngle) > -1) 
    angles = angles[ -1 ]
  else 
    angles = angles[ -m ]
}
angles = angles + nearestAngle
angles[angles < 0] = angles[ angles < 0 ] + 360
angles[angles > 360] = angles[ angles > 360 ] - 360
offlineSubset = 
  offlineSummary[ offlineSummary$angle %in% angles, ]

reshapeSS = function(data, varSignal = "signal", 
                     keepVars = c("posXY", "posX","posY")) {
  byLocation =
    with(data, by(data, list(posXY), 
                  function(x) {
                    ans = x[1, keepVars]
                    avgSS = tapply(x[ , varSignal ], x$mac, mean)
                    y = matrix(avgSS, nrow = 1, ncol = 6,
                               dimnames = list(ans$posXY,
                                               names(avgSS)))
                    cbind(ans, y)
                  }))

  newDataSS = do.call("rbind", byLocation)
  return(newDataSS)
}
  1. 'posXY'
  2. 'posX'
  3. 'posY'
  4. 'orientation'
  5. 'angle'
  6. '00:0f:a3:39:dd:cd'
  7. '00:14:bf:3b:c7:c6'
  8. '00:14:bf:b1:97:81'
  9. '00:14:bf:b1:97:8a'
  10. '00:14:bf:b1:97:8d'
  11. '00:14:bf:b1:97:90'
In [66]:
trainSS = reshapeSS(offlineSubset, varSignal = "avgSignal")

selectTrain = function(angleNewObs, signals = NULL, m = 1) {
  # m is the number of angles to keep between 1 and 5
  refs = seq(0, by = 45, length  = 8)
  nearestAngle = roundOrientation(angleNewObs)
  
  if (m %% 2 == 1) 
    angles = seq(-45 * (m - 1) /2, 45 * (m - 1) /2, length = m)
  else {
    m = m + 1
    angles = seq(-45 * (m - 1) /2, 45 * (m - 1) /2, length = m)
    if (sign(angleNewObs - nearestAngle) > -1) 
      angles = angles[ -1 ]
    else 
      angles = angles[ -m ]
  }
  angles = angles + nearestAngle
  angles[angles < 0] = angles[ angles < 0 ] + 360
  angles[angles > 360] = angles[ angles > 360 ] - 360
  angles = sort(angles) 
  
  offlineSubset = signals[ signals$angle %in% angles, ]
  reshapeSS(offlineSubset, varSignal = "avgSignal")
}

train130 = selectTrain(130, offlineSummary, m = 3)

head(train130)

length(train130[[1]])
posXYposXposY00:0f:a3:39:dd:cd00:14:bf:3b:c7:c600:14:bf:b1:97:8100:14:bf:b1:97:8a00:14:bf:b1:97:8d00:14:bf:b1:97:90
0-00-0 0 0 -72 -66 -63 -36 -64 -55
0-10-1 0 1 -70 -65 -64 -39 -65 -59
0-100-100 10 -70 -66 -69 -45 -67 -50
0-110-110 11 -71 -67 -70 -48 -67 -55
0-120-120 12 -69 -70 -72 -45 -67 -50
0-130-130 13 -73 -71 -73 -43 -69 -54
166
In [67]:
findNN = function(newSignal, trainSubset) {
  diffs = apply(trainSubset[ , 4:9], 1, 
                function(x) x - newSignal)
  dists = apply(diffs, 2, function(x) sqrt(sum(x^2)) )
  closest = order(dists)
  return(trainSubset[closest, 1:3 ])
}
In [68]:
predXY = function(newSignals, newAngles, trainData, numAngles = 1, k = 3) {
  
  closeXY = list(length = nrow(newSignals))
  
  for (i in 1:nrow(newSignals)) {
    trainSS = selectTrain(newAngles[i], trainData, m = numAngles)
    closeXY[[i]] = 
      findNN(newSignal = as.numeric(newSignals[i, ]), trainSS)
  }

  estXY = lapply(closeXY, 
                 function(x) sapply(x[ , 2:3], 
                                    function(x) mean(x[1:k])))
  estXY = do.call("rbind", estXY)
  return(estXY)
}
estXYk3 = predXY(newSignals = onlineSummary[ , 6:11], 
                 newAngles = onlineSummary[ , 4], 
                 offlineSummary, numAngles = 3, k = 3)
                                    

estXYk1 = predXY(newSignals = onlineSummary[ , 6:11], 
                 newAngles = onlineSummary[ , 4], 
                 offlineSummary, numAngles = 3, k = 1)
In [69]:
floorErrorMap = function(estXY, actualXY, trainPoints = NULL, AP = NULL) {
  
    plot(0, 0, xlim = c(0, 35), ylim = c(-3, 15), type = "n",
         xlab = "", ylab = "", axes = FALSE)
    cap = "Figure x: Prediction vs. Real Location"
    box()
    if ( !is.null(AP) ) points(AP, pch = 15)
    if ( !is.null(trainPoints) )
      points(trainPoints, pch = 19, col="grey", cex = 0.6)
    
    points(x = actualXY[, 1], y = actualXY[, 2], 
           pch = 19, cex = 0.8 )
    points(x = estXY[, 1], y = estXY[, 2], 
           pch = 8, cex = 0.8 )
    segments(x0 = estXY[, 1], y0 = estXY[, 2],
             x1 = actualXY[, 1], y1 = actualXY[ , 2],
             lwd = 2, col = "red")
}

trainPoints = offlineSummary[ offlineSummary$angle == 0 & 
                              offlineSummary$mac == "00:0f:a3:39:dd:cd" ,
                        c("posX", "posY")]                                    

# 3

floorErrorMap(estXYk3, onlineSummary[ , c("posX","posY")], 
              trainPoints = trainPoints, AP = AP)


# 1

floorErrorMap(estXYk1, onlineSummary[ , c("posX","posY")], 
              trainPoints = trainPoints, AP = AP)

The graphs above represents the floor plan of the building with both the predicted and the actual locations. The red lines connect the test locations (black dots) to their predicted locations (asterisks). The top plot shows the predictions for $k$ = 3 and the bottom plot is for $k$ = 1 nearest neighbors for the model with MAC address 00:0f:a3:39:dd:cd. $K$=3 looks like a better approximation.

Here, the sum of squared error function is created for comparing models. The error for $k$=1 is 411.64 and for $k$=3 error is 270. 46. Meaning that including 3 nearest neighbors gives a more accurate location. These are lower error estimates than the previous model.

Next, cross-validated selection of $k$ performed. The training data is divided into v non-overlapping subsets of equal size. Each subset a model is built with the data that are not in that subset and the predictive ability of the model is assessed using the subset that was left out. This process is repeated and assessed for each of the v folds and the prediciton errors are aggregated across the folds.

set.seed(15) was used for all models.

In [70]:
set.seed(15)
calcError = 
function(estXY, actualXY) 
   sum( rowSums( (estXY - actualXY)^2) )

actualXY = onlineSummary[ , c("posX", "posY")]
sapply(list(estXYk1, estXYk3), calcError, actualXY)
    
set.seed(15)
v = 11
permuteLocs = sample(unique(offlineSummary$posXY))
permuteLocs = matrix(permuteLocs, ncol = v, 
                     nrow = floor(length(permuteLocs)/v))

onlineFold = subset(offlineSummary, posXY %in% permuteLocs[ , 1])

reshapeSS = function(data, varSignal = "signal", 
                     keepVars = c("posXY", "posX","posY"),
                     sampleAngle = FALSE, 
                     refs = seq(0, 315, by = 45)) {
  byLocation =
    with(data, by(data, list(posXY), 
                  function(x) {
                    if (sampleAngle) {
                      x = x[x$angle == sample(refs, size = 1), ]}
                    ans = x[1, keepVars]
                    avgSS = tapply(x[ , varSignal ], x$mac, mean)
                    y = matrix(avgSS, nrow = 1, ncol = 6,
                               dimnames = list(ans$posXY,
                                               names(avgSS)))
                    cbind(ans, y)
                  }))

  newDataSS = do.call("rbind", byLocation)
  return(newDataSS)
}
  1. 411.6403
  2. 270.458077777778
Warning message in matrix(permuteLocs, ncol = v, nrow = floor(length(permuteLocs)/v)):
"data length [166] is not a sub-multiple or multiple of the number of rows [15]"
In [71]:
offlineRedo = offlineRedo[ offlineRedo$mac != unusedMac, ]

keepVars = c("posXY", "posX","posY", "orientation", "angle")

onlineCVSummary = reshapeSS(offlineRedo, keepVars = keepVars, 
                            sampleAngle = TRUE)

onlineFold = subset(onlineCVSummary, 
                    posXY %in% permuteLocs[ , 1])

offlineFold = subset(offlineSummary,
                     posXY %in% permuteLocs[ , -1])

estFold = predXY(newSignals = onlineFold[ , 6:11], 
                 newAngles = onlineFold[ , 4], 
                 offlineFold, numAngles = 3, k = 3)

actualFold = onlineFold[ , c("posX", "posY")]
calcError(estFold, actualFold)

K = 20
err = rep(0, K)

for (j in 1:v) {
  onlineFold = subset(onlineCVSummary, 
                      posXY %in% permuteLocs[ , j])
  offlineFold = subset(offlineSummary,
                       posXY %in% permuteLocs[ , -j])
  actualFold = onlineFold[ , c("posX", "posY")]
  
  for (k in 1:K) {
    estFold = predXY(newSignals = onlineFold[ , 6:11],
                     newAngles = onlineFold[ , 4], 
                     offlineFold, numAngles = 3, k = k)
    err[k] = err[k] + calcError(estFold, actualFold)
  }
}
89.8888888888889
In [72]:
plot(y = err, x = (1:K),  type = "l", lwd= 2,
     ylim = c(900, 2100),
     xlab = "Number of Neighbors",
     ylab = "Sum of Square Errors")
title("SSE for k values 1 to 20 with MAC 00:0f:a3:39:dd:cd")
rmseMin = min(err)
rmseMin
kMin = which(err == rmseMin)[1]
kMin
segments(x0 = 0, x1 = kMin, y0 = rmseMin, col = gray(0.4), 
         lty = 2, lwd = 2)
segments(x0 = kMin, x1 = kMin, y0 = 870,  y1 = rmseMin, 
         col = grey(0.4), lty = 2, lwd = 2)
text(x = kMin - 2, y = rmseMin + 40, 
     label = as.character(round(rmseMin)), col = grey(0.4))
1002.63265306122
7
In [73]:
#Estimated Error

estXYk7 = predXY(newSignals = onlineSummary[ , 6:11], 
                 newAngles = onlineSummary[ , 4], 
                 offlineSummary, numAngles = 3, k = 7)
calcError(estXYk7, actualXY)
258.074993877551

Error is lower with only MAC 00:0f:a3:39:dd:cd

KNN with both MACs 00:0f:a3:39:e1:c0 and 00:0f:a3:39:dd:cd

Analysis is performed in the same method as the first model except including both MACs 00:0f:a3:39:e1:c0 and 00:0f:a3:39:dd:cd

In [74]:
#redo data and include both Macs 00:0f:a3:39:e1:c0 and 00:0f:a3:39:dd:cd
subMacs = c("00:0f:a3:39:e1:c0", "00:0f:a3:39:dd:cd", "00:14:bf:b1:97:8a",
                       "00:14:bf:3b:c7:c6", "00:14:bf:b1:97:90", "00:14:bf:b1:97:8d",
                       "00:14:bf:b1:97:81")

offlineRedo = readData()

oldPar = par(mar = c(3.1, 3, 1, 1))

offlineRedo$posXY = paste(offlineRedo$posX, offlineRedo$posY, sep = "-")

byLocAngleAP = with(offlineRedo, 
                    by(offlineRedo, list(posXY, angle, mac), 
                       function(x) x))

signalSummary = 
  lapply(byLocAngleAP,            
         function(oneLoc) {
           ans = oneLoc[1, ]
           ans$medSignal = median(oneLoc$signal)
           ans$avgSignal = mean(oneLoc$signal)
           ans$num = length(oneLoc$signal)
           ans$sdSignal = sd(oneLoc$signal)
           ans$iqrSignal = IQR(oneLoc$signal)
           ans
           })

offlineSummary = do.call("rbind", signalSummary)
In [75]:
macs = unique(offlineSummary$mac)    
macs                       
    
online = readData("online.final.trace.txt", subMacs = macs)

online$posXY = paste(online$posX, online$posY, sep = "-")

length(unique(online$posXY))

tabonlineXYA = table(online$posXY, online$angle)
tabonlineXYA[1:7, ]

keepVars = c("posXY", "posX","posY", "orientation", "angle")
byLoc = with(online, 
             by(online, list(posXY), 
                function(x) {
                  ans = x[1, keepVars]
                  avgSS = tapply(x$signal, x$mac, mean)
                  y = matrix(avgSS, nrow = 1, ncol = 7,
                        dimnames = list(ans$posXY, names(avgSS)))
                  cbind(ans, y)
                }))

onlineSummary = do.call("rbind", byLoc)
  1. '00:0f:a3:39:dd:cd'
  2. '00:0f:a3:39:e1:c0'
  3. '00:14:bf:3b:c7:c6'
  4. '00:14:bf:b1:97:81'
  5. '00:14:bf:b1:97:8a'
  6. '00:14:bf:b1:97:8d'
  7. '00:14:bf:b1:97:90'
60
            
               0  45  90 135 180 225 270 315
  0-0.05       0   0   0 704   0   0   0   0
  0.15-9.42    0   0 717   0   0   0   0   0
  0.31-11.09   0   0   0   0   0 684   0   0
  0.47-8.2   701   0   0   0   0   0   0   0
  0.78-10.94 695   0   0   0   0   0   0   0
  0.93-11.69   0   0   0   0 691   0   0   0
  1.08-12.19   0   0   0   0   0 742   0   0
In [76]:
names(onlineSummary)
m = 3; angleNewObs = 230
refs = seq(0, by = 45, length  = 8)
nearestAngle = roundOrientation(angleNewObs)
  
if (m %% 2 == 1) {
  angles = seq(-45 * (m - 1) /2, 45 * (m - 1) /2, length = m)
} else {
  m = m + 1
  angles = seq(-45 * (m - 1) /2, 45 * (m - 1) /2, length = m)
  if (sign(angleNewObs - nearestAngle) > -1) 
    angles = angles[ -1 ]
  else 
    angles = angles[ -m ]
}
angles = angles + nearestAngle
angles[angles < 0] = angles[ angles < 0 ] + 360
angles[angles > 360] = angles[ angles > 360 ] - 360
offlineSubset = 
  offlineSummary[ offlineSummary$angle %in% angles, ]

reshapeSS = function(data, varSignal = "signal", 
                     keepVars = c("posXY", "posX","posY")) {
  byLocation =
    with(data, by(data, list(posXY), 
                  function(x) {
                    ans = x[1, keepVars]
                    avgSS = tapply(x[ , varSignal ], x$mac, mean)
                    y = matrix(avgSS, nrow = 1, ncol = 7,
                               dimnames = list(ans$posXY,
                                               names(avgSS)))
                    cbind(ans, y)
                  }))

  newDataSS = do.call("rbind", byLocation)
  return(newDataSS)
}

trainSS = reshapeSS(offlineSubset, varSignal = "avgSignal")

selectTrain = function(angleNewObs, signals = NULL, m = 1) {
  # m is the number of angles to keep between 1 and 5
  refs = seq(0, by = 45, length  = 8)
  nearestAngle = roundOrientation(angleNewObs)
  
  if (m %% 2 == 1) 
    angles = seq(-45 * (m - 1) /2, 45 * (m - 1) /2, length = m)
  else {
    m = m + 1
    angles = seq(-45 * (m - 1) /2, 45 * (m - 1) /2, length = m)
    if (sign(angleNewObs - nearestAngle) > -1) 
      angles = angles[ -1 ]
    else 
      angles = angles[ -m ]
  }
  angles = angles + nearestAngle
  angles[angles < 0] = angles[ angles < 0 ] + 360
  angles[angles > 360] = angles[ angles > 360 ] - 360
  angles = sort(angles) 
  
  offlineSubset = signals[ signals$angle %in% angles, ]
  reshapeSS(offlineSubset, varSignal = "avgSignal")
}

train130 = selectTrain(130, offlineSummary, m = 3)
  1. 'posXY'
  2. 'posX'
  3. 'posY'
  4. 'orientation'
  5. 'angle'
  6. '00:0f:a3:39:dd:cd'
  7. '00:0f:a3:39:e1:c0'
  8. '00:14:bf:3b:c7:c6'
  9. '00:14:bf:b1:97:81'
  10. '00:14:bf:b1:97:8a'
  11. '00:14:bf:b1:97:8d'
  12. '00:14:bf:b1:97:90'
In [77]:
findNN = function(newSignal, trainSubset) {
  diffs = apply(trainSubset[ , 4:9], 1, 
                function(x) x - newSignal)
  dists = apply(diffs, 2, function(x) sqrt(sum(x^2)) )
  closest = order(dists)
  return(trainSubset[closest, 1:3 ])
}
In [78]:
predXY = function(newSignals, newAngles, trainData, numAngles = 1, k = 3) {
  
  closeXY = list(length = nrow(newSignals))
  
  for (i in 1:nrow(newSignals)) {
    trainSS = selectTrain(newAngles[i], trainData, m = numAngles)
    closeXY[[i]] = 
      findNN(newSignal = as.numeric(newSignals[i, ]), trainSS)
  }

  estXY = lapply(closeXY, 
                 function(x) sapply(x[ , 2:3], 
                                    function(x) mean(x[1:k])))
  estXY = do.call("rbind", estXY)
  return(estXY)
}
                                    
estXYk3 = predXY(newSignals = onlineSummary[ , 6:11], 
                 newAngles = onlineSummary[ , 4], 
                 offlineSummary, numAngles = 3, k = 3)
                                    

estXYk1 = predXY(newSignals = onlineSummary[ , 6:11], 
                 newAngles = onlineSummary[ , 4], 
                 offlineSummary, numAngles = 3, k = 1)   
                                    

calcError = 
function(estXY, actualXY) 
   sum( rowSums( (estXY - actualXY)^2) )

actualXY = onlineSummary[ , c("posX", "posY")]
sapply(list(estXYk1, estXYk3), calcError, actualXY)
  1. 510.4003
  2. 244.206966666667

Here, the sum of squared error function is created for comparing models. The error for $k$=1 is 510.4 and for $k$=3 error is 244.21. Meaning that including 3 nearest neighbors gives a more accurate location.

Next, cross-validated selection of $k$ performed. The training data is divided into v non-overlapping subsets of equal size. Each subset a model is built with the data that are not in that subset and the predictive ability of the model is assessed using the subset that was left out. This process is repeated and assessed for each of the v folds and the prediciton errors are aggregated across the folds.

In [79]:
set.seed(15)
v = 11
permuteLocs = sample(unique(offlineSummary$posXY))
permuteLocs = matrix(permuteLocs, ncol = v, 
                     nrow = floor(length(permuteLocs)/v))

onlineFold = subset(offlineSummary, posXY %in% permuteLocs[ , 1])

reshapeSS = function(data, varSignal = "signal", 
                     keepVars = c("posXY", "posX","posY"),
                     sampleAngle = FALSE, 
                     refs = seq(0, 315, by = 45)) {
  byLocation =
    with(data, by(data, list(posXY), 
                  function(x) {
                    if (sampleAngle) {
                      x = x[x$angle == sample(refs, size = 1), ]}
                    ans = x[1, keepVars]
                    avgSS = tapply(x[ , varSignal ], x$mac, mean)
                    y = matrix(avgSS, nrow = 1, ncol = 7,
                               dimnames = list(ans$posXY,
                                               names(avgSS)))
                    cbind(ans, y)
                  }))

  newDataSS = do.call("rbind", byLocation)
  return(newDataSS)
}
Warning message in matrix(permuteLocs, ncol = v, nrow = floor(length(permuteLocs)/v)):
"data length [166] is not a sub-multiple or multiple of the number of rows [15]"
In [80]:
set.seed(15)

keepVars = c("posXY", "posX","posY", "orientation", "angle")

onlineCVSummary = reshapeSS(offlineRedo, keepVars = keepVars, 
                             sampleAngle = TRUE)

onlineFold = subset(onlineCVSummary, 
                     posXY %in% permuteLocs[ , 1])

offlineFold = subset(offlineSummary,
                     posXY %in% permuteLocs[ , -1])

estFold = predXY(newSignals = onlineFold[ , 6:11], 
                 newAngles = onlineFold[ , 4], 
                 offlineFold, numAngles = 3, k = 3)

actualFold = onlineFold[ , c("posX", "posY")]
calcError(estFold, actualFold)

K = 20
err = rep(0, K)

for (j in 1:v) {
  onlineFold = subset(onlineCVSummary, 
                      posXY %in% permuteLocs[ , j])
  offlineFold = subset(offlineSummary,
                       posXY %in% permuteLocs[ , -j])
  actualFold = onlineFold[ , c("posX", "posY")]
  
  for (k in 1:K) {
    estFold = predXY(newSignals = onlineFold[ , 6:11],
                     newAngles = onlineFold[ , 4], 
                     offlineFold, numAngles = 3, k = k)
    err[k] = err[k] + calcError(estFold, actualFold)
  }
}
65
In [81]:
plot(y = err, x = (1:K),  type = "l", lwd= 2,
     ylim = c(1300, 2100),
     xlab = "Number of Neighbors",
     ylab = "Sum of Square Errors")
title("SSE for k values 1 to 20 with both MACs")
rmseMin = min(err)
rmseMin
kMin = which(err == rmseMin)[1]
kMin
segments(x0 = 0, x1 = kMin, y0 = rmseMin, col = gray(0.4), 
         lty = 2, lwd = 2)
segments(x0 = kMin, x1 = kMin, y0 = 1280,  y1 = rmseMin, 
         col = grey(0.4), lty = 2, lwd = 2)
text(x = kMin - 2, y = rmseMin + 40, 
     label = as.character(round(rmseMin)), col = grey(0.4))
1458.59375
8
In [82]:
#Estimated Error

estXYk8 = predXY(newSignals = onlineSummary[ , 6:11], 
                 newAngles = onlineSummary[ , 4], 
                 offlineSummary, numAngles = 3, k = 8)

calcError(estXYk8, actualXY)
235.195925

Error is even lower when including both MACs 00:0f:a3:39:dd:cd and 00:0f:a3:39:dd:cd

Weighted KNN with both MACs

A weighted kNN is designed to give more weight to the data points which are close by and less weight to the points which are farther away. Instead of averaging as in the previous models, this modified kNN applies weights to the X-Y positions of each k closest locations and takes the sum of those results. The predXY function is modified to calculate weight as inversely proportional to the distance from the test observation.

In [83]:
# changing the findNN function for weighted knn

findNN = function(newSignal, trainSubset) {
  diffs = apply(trainSubset[ , 4:9], 1, 
                function(x) x - newSignal)
  dists = apply(diffs, 2, function(x) sqrt(sum(x^2)) )
  closest = order(dists)
  return(list(trainSubset[closest, 1:3 ], dists[order(dists)])) #also now returns distances for finding nearest points
}
In [84]:
#modifying predXY function for weighted knn - Euclidean
#
set.seed(15)
predXY = function(newSignals, newAngles, trainData, 
                  numAngles = 1, k = 3){
  closeXY = list(length = nrow(newSignals))
  distance = list(length = nrow(newSignals))
  
  for (i in 1:nrow(newSignals)) {
    trainSS = selectTrain(newAngles[i], trainData, m = numAngles)
    wgtNN = findNN(newSignal = as.numeric(newSignals[i, ]), trainSS)
  
    closeXY[[i]] = wgtNN[[1]]
    distance[[i]] = wgtNN[[2]]
  }
  
    
  distWgt = list(length = length(distance))
  for (i in 1:length(distance)){
    distW = list(length = k)
    for (j in 1:k){
      distW[j] = (1/distance[[i]][j])/sum(1/distance[[i]][1:k])
    }
     
    distWgt[[i]] = distW
  }
  estXY1 = list(length = length(closeXY))
  
  for(i in 1:length(closeXY)){
    estXY1[[i]] = as.matrix(closeXY[[i]][1:k,2:3]) * unlist(distWgt[[i]])
  }
  
  estXY = lapply(estXY1,
                 function(x) apply(x, 2,
                                   function(x) sum(x)))
  
  
  estXY = do.call("rbind", estXY)
  return(estXY)
}
In [85]:
set.seed(15)

K = 20
err = rep(0, K)

for (j in 1:v) {
  onlineFold = subset(onlineCVSummary, 
                      posXY %in% permuteLocs[ , j])
  offlineFold = subset(offlineSummary,
                       posXY %in% permuteLocs[ , -j])
  actualFold = onlineFold[ , c("posX", "posY")]
  
  for (k in 1:K) {
    estFold = predXY(newSignals = onlineFold[ , 6:11],
                     newAngles = onlineFold[ , 4], 
                     offlineFold, numAngles = 3, k = k)
    err[k] = err[k] + calcError(estFold, actualFold)
  }
}
In [86]:
#plot 
plot(y = err, x = (1:K),  type = "l", lwd= 2,
     ylim = c(1200, 2100),
     xlab = "Number of Neighbors",
     ylab = "Sum of Square Errors")
title("SSE for k values 1 to 20, weighted with both MACs")
rmseMin = min(err)
rmseMin
kMin = which(err == rmseMin)[1]
kMin
segments(x0 = 0, x1 = kMin, y0 = rmseMin, col = gray(0.4), 
         lty = 2, lwd = 2)
segments(x0 = kMin, x1 = kMin, y0 = 1170,  y1 = rmseMin, 
         col = grey(0.4), lty = 2, lwd = 2)
text(x = kMin - 2, y = rmseMin + 40, 
     label = as.character(round(rmseMin)), col = grey(0.4))
1396.13815043264
9
In [87]:
#Estimated Error
set.seed(15)

estXYk9 = predXY(newSignals = onlineSummary[ , 6:11], 
                 newAngles = onlineSummary[ , 4], 
                 offlineSummary, numAngles = 3, k = kMin)

calcError(estXYk9, actualXY)
227.442300437037

Error estimate is the lowest for the weighted model that includes both MAC addresses