Author: Nicolas N
Last Updated: 06 January, 2006
Let's say we want to find the number of students at McGill who are 18 to 23 years old. To do this, we go through the list of all students and count the ones whose age that fall in that range.
Now, let's say we want to find the number of students aged between 20 and 25. We could repeat our previous approach, which is to go through the list of students and start counting again. However, if we have to do this query multiple times on the same data, it becomes clear that our brute force approach of counting over every student each time is not very efficient.
Imagine instead that we group all students with the same age inside a "bin". We then keep a count of number of students in each bin. Whenever we need to query the number of students that belong to an age range, we would only have to lookup the right set of bins and add the student counts.
Students Grouped by Age |
This is the basis for the locus method of solving range search problems. The search space is divided into regions or loci where the answer does not vary (e.g. student age). The query is then mapped to a point in the search space, and solving the query consists of finding the particular region (or "bin") where that point lies.
More generally, geometric range searching can be described as:
given a set of sample points in space, and
a standard geometric shape (e.g. a rectangle) translated in the space,
find a fast way to compute the number of the points contained by the geometric shape.
In this section, we provide some background on the concept of dominance. Later on, we will see how this information can help us answer range searching queries faster.
Definition: A point p(p_{1},...p_{k}) is said to dominate a point w(w_{1},...w_{k}) if and only if p_{i} >= w_{i} for all i's, in other words, p is greater or equal to w in all coordinates. Remark that the notion of dominance can be applied to points in any dimensions.
In 2 dimensions, the region dominated by a point p corresponds to the southwest quadrant of p (green area in the above diagram).
The number of sample points dominated by p, sometimes known as the rank of p, is denoted as Q(p). In the above diagram, Q(p) = 8.
Claim: The number N of points contained in a rectangle defined by 4 points p1, p2, p3, p4 can be determined by solving 4 dominance queries.
In particular, N = Q(p1) - Q(p2) - Q(p4) + Q(p3)
where Q(p_{i}) is the dominance value of the point p_{i}
Proof:
It is easy to see that any point inside the rectangle [p1 p2 p3 p4] must be dominated by the point p1.
dominance region of p1: Q(p1)
From that list of points, we must then remove those that are dominated by p4 and by p2 since they are not inside the query rectangle.
Q(p4) and Q(p2). Note the overlap region (lower left corner).
However by subtracting these 2 regions, we also remove points in the overlapped region twice. So we end up with a deficit in our point count.
Q(p1) - Q(p2) - Q(p3). Note "deficit" region on the lower left corner.
As it turns out, the region dominated by both p2 and p4 is the one dominated by p3.
Q(p3)
So, by adding the dominance value of p3 back to the equation, we get the correct count of points contained in the query rectangle.
N(p1 p2 p3 p4) = Q(p1) - Q(p2) - Q(p4) + Q(p3).
In this example, N = 8 - 5 - 3 + 2 = 2
When analyzing which range searching algorithm performs best, it is convenient to divide its performance in 3 categories, each with its own cost:
Query time. How long does it take to process a single query?
Storage (space). How much memory is needed to store the sample points in our data structure?
Pre-processing time. How long does it take to (better) arrange the data before searching?
As is typical in most algorithm, reducing the cost of one step
usually means increasing the cost of another. A trade-off often needs to
be made.
Specifically, if we plan to do multiple queries on the same data (repetitive-mode), then it is worthwhile to spend more time preprocessing the data and storing it in such a way that future search operations run faster.
On the other hand, if we're only interested in doing a single-shot query, then it is typically cheaper overall to just search without any data processing.
With that information in mind, the next section outlines various techniques to solve the range searching problem. We start with the simplest and go to the most complex.
No preprocessing is done on the data. To answer a point location query i.e. compute the dominance of a new point, we simply perform 2 linear searches (one for each coordinate) and identify the sample points with smaller or equal x and y coordinates.
Brute force searching: green lines identify points dominated by p in the x-coordinates and orange lines identify those dominated in y-coordinates. 3 sample points are dominated in both coordinates.
The storage requirement is O(N), essentially an array containing the coordinates of all points. No special arrangement of the data is required.
The query requires O(N) time because of the 2 linear searches.
O(N^{2}) partitions with constant dominance values
In this approach, we want to want to partition the search space into regions that have the same dominance values. To do this, we first note that dominance values can only change when we pass through a sample point in a given coordinate. So, given a set of sample points, we project perpendiculars from each sample point to the x axis and to the y axis. This divides the space into at most (N+1)^{2} rectangles all satisfying the above condition.
We then compute the dominance value of each partition by linearly checking how many sample points is dominated by the upper right corner of the partition rectangle. This naive counting approach takes O(N^{3}) time to process all O(N^{2}) partitions. The motivation for all this work is of course so that queries can be solved faster than with the brute force approach.
To answer a point location query, we only need to determine in which partition that point lies. Knowing the partition immediately gives us the dominance value of the point. To find the correct partition, we use a binary search for each coordinate of the point (note that the sample points were sorted to create partitions in the preprocessing step). Hence, in 2D, a location query can be answered by performing 2 binary searches i.e. O(log N) time.
A range search query (for finding the number of sample points contained inside a rectangle) can be answered by performing 4 location queries, one for each vertex of the rectangle; hence, O(log N). Using the inclusion-exclusion principle discussed above, we can then easily (O(1) time) compute the number of points bounded by the rectangle.
The most obvious drawback of the naive approach is the O(N^{3}) processing time required to compute dominance values of each partition.
The incremental approach reduces the time required for this step to O(N^{2}) with no increase in query time or storage space.
values are incremented when a sample point is encountered
The trick is to compute the dominance values of each partition incrementally rather than scanning through every sample point each time. The following pseudocode explains this step:
for each row y (starting from the lowest row) value_increment = 0 for each column x (starting the leftmost column) if (sample point is present at position (x,y)) then value_increment = value_increment + 1 dominance[x][y] = dominance[x][y-1] + value_increment
This is a bit similar to dynamic programming. In essence, we fill each row with values from the row just below it, except when we encounter a sample point somewhere along that row. In that case, the dominance value of the current partition (and all partitions to the right of it) is incremented by 1, since they dominate that point. The process is repeated until all partitions have been processed.
The applet in this tutorial illustrates this approach.
For completeness, we should mention that Shamos and Bentley also designed a divide and conquer method for solving this problem. Details of this approach can be found in the book “Computational Geometry” listed in the references section. Basically, it reduces the storage and preprocessing time to O(NlogN) but at the cost of increasing query time to worst-case O(log^{2} N) and using a much more complex data structure (k-d trees) to store the sample points.
A sketch of the preprocessing step follows:
Find the median x-coordinate of the sample points, then split the samples into 2 sets A and B about that point. Both A and B will contain N/2 sample points.
partition points about median x-coordinate: line L
Recursively split and find dominance values of all points in subsets A and B, separately.
Merge dominance results from subsets A and B. Note that when merging, the dominance values in A don't change since no point in A dominate any point in B (their x-values are smaller).
point z in A cannot dominate any point in set B
On the other hand, points in B might dominate points in A. So, while merging, dominance values in subset B will need to be adjusted.
Observe further that we know all points in B dominate all points in A in the x-coordinate (that's how the 2 sets were partitioned). So we only need to solve the dominance problem in the other dimension i.e. the y-coordinates. If we choose to sort all the points in their y-coordinates, then binary search can be used to identify all points in A that are dominated in just O(log N) time for each point in B. The entire preprocessing step runs in O(NlogN) time.
point z in B can dominate points in set A. We only need to check y-coordinates to determine dominance.
A similar divide-and-conquer strategy is used to answer point location queries. In the best case, the query point always lies in subset A, thus, can be solved recursively by looking at a subproblem half the size each time => O(log N).
In the worst case, the point always lies in subset B. So, not only do we need to find its rank in B, solvable in O(logN), but we also need to determine the number of points in A that it dominates each time. This part can be solved using binary search (as seen in the preprocessing step). Therefore, the worst case total time is O(log^{2}N).
For simplicity, we assumed in the previous section that the problem is being solved in 2 dimensions. Dominance values can however be defined for a point in any dimension. All of the algorithms discussed above can easily be generalized to work in any dimension. A summary of their complexities for the general case follows:
Algorithms |
Query |
Storage |
Pre-processing |
---|---|---|---|
Brute Force |
O(kN) |
O(kN) |
zero |
Naïve Counting |
O(k.log N) |
O(N^{k}) |
O(N^{k+1}) |
Incremental |
O(k.log N) |
O(N^{k}) |
O(N^{k}) |
Divide and Conquer |
O(log^{k} N) |
O(Nlog^{k-1} N) |
O(Nlog^{k-1} N) |
where
k is the dimension of the space
N is the number of sample points.
The applet illustrates the incremental approach of solving the range search problem.
Instructions:
Click on "Generate Sample" to generate a random number of sample points.
Click "Processs" to partition the search space and compute dominance values.
Click anywhere in the search space to determine the dominance value of a new point i.e. "location query".
Click on "Range Search" to determine the number of sample points bounded a box. To define the query box, click on a point in the search space, then click on a second point to define the diagonal of your box. Note that the partitions used in computing the point count are highlighted.
J. Bentley and M. Shamos, "A problem in multivariate statistics: algorithms, data structure and applications", 15th Allerton conference on Communication, Control and Computing, 1977.
F. Preparata and M. Shamos, "Chapter 2: Geometric Searching", Computational Geometry, 1985