Background
Multidimensional Data
- Flat relational tables
- Multimedia feature vectors
- Data warehouse data
- Spatial data
- Text documents
Attribute Types
- Attributes of multidimensional tuples may have variable types
- Ordinal (e.g., age, salary)
- Nominal categorical values (e.g., color, religion)
- Binary (e.g., gender, owns_property)
- Basic queries: range, NN, similarity
Basic Queries
- (Range) selection query
- Returns the records that qualify a (multidimensional) range predicate
- Example:
- Return the employees of age between 45 and 50 and salary above $100,000
- Distance (similarity) query
- Returns the records that are within a distance from a reference record.
- Example:
- Find images with feature vectors of Euclidean distance at most ε with the feature vector of a given image
- Nearest neighbor (similarity) query
- Replaces distance bound by ranking predicate
Top-k Search Methods
- Rank aggregation
- Index-based methods
Top-k Query
- Given a set of objects (e.g., relational tuples),
- Returns the k objects with the highest combined score, based on an aggregate function f.
- Example:
- Relational table containing information about restaurants, with attributes(e.g. price, quality, location)
- f: sum(-price, quality, -dist(location,my_hotel))
- attribute value ranges are usually normalized
- E.g., all values in a (0,1) range
- otherwise some attribute may be favored in f
Top-k Query Variants
Apply on single table, or ranked lists of tuples ordered by individual attributes
Ranked inputs in the same or different servers (centralized or distributed data)
Standalone query or operator in a more complex query plan
Incremental retrieval of objects with highest scores (k is not predefined)
Top-k joins
SELECT h.id, s.id
FROM House h School s
WHERE h.location=s.location
ORDER BY h.price + 10 ∗ s.tuition
LIMIT 5
Probabilistic/approximate top-k retrieval
Random and/or sorted accesses at ranked inputs
Top-k Query Evaluation
Most solutions assume distributive, monotone aggregate functions (e.g. f=sum)
- distributive: f(x,y,z,w)= f(f(x,y),f(z,w))
- e.g., A+B+C+D = (A+B) + (C+D)
- monotone: if x<y and z<w, then f(x,z)<f(y,w)
- distributive: f(x,y,z,w)= f(f(x,y),f(z,w))
Solutions based on 1-D ordering and merging sorted lists (rank aggregation)
Solutions based on multidimensional indexing
Rank Aggregation
- Solutions based on 1-D ordering and merging sorted lists (rank aggregation)
- Assume that there is a total ranking of theobjects for each attributethat can be used in top-kqueries
- These sorted inputs canbe accessed sequentiallyand/or by random accesses
Advantages and Drawbacks
- Advantages:
- can be applied on any subset of inputs (arbitrary subspace)
- appropriate for distributed data
- appropriate for top-k joins
- easy to understand and implement
- Drawbacks:
- slower than index-based methods
- require inputs to be sorted
TA: Threshold Algorithm
Introduction
- Iteratively retrieves objects and their atomic scores from the ranked inputs in a round-robin fashion.
- For each encountered object x, perform random accesses to the inputs where x has not been seen.
- Maintain top-k objects seen so far.
- T = f($l_1$, . . . , $l_m$) is the score derived when applying the aggregation function to the last atomic scores seen at each input. If the score of the k-th object is no smaller than T, terminate.
Example of TA(k=1,f=sum)
- STEP 1
- top-1 is c, with score 2.0
- T=sum(0.9,0.9,0.9)=2.7
- T>top-1, we proceed to another round of accesses
- STEP 2
- top-1 is b, with score 2.2
- T=sum(0.8,0.8,0.9)=2.5
- T>top-1, we proceed to another round of accesses
- STEP 3
- top-1 is b, with score 2.2
- T=sum(0.6,0.6,0.8)=2.0
- T≤top-1, terminate and output (b,2.2)
Properties of TA
- Used as a standard module for merging ranked lists in many applications
- Usually finds the result quickly
- Depends on random accesses, which can be expensive
- random accesses are impossible in some cases
- e.g., an API allows to access objects incrementally by ranking score, but does not provide the score of a given object
NRA: No Random Accesses
Introduction
- Iteratively retrieves objects and their atomic scores from the ranked inputs in a round-robin fashion.
- For each object x seen so far at any input maintain:
- f_x_ub: upper bound for x’s aggregate score (f_x)
- f_x_lb: lower bound for x’s aggregate score (f_x)
- W_k = k objects with the largest f^lb.
- If the smallest f^lb in W_k is at least the largest f_x_ub of any object x not in W_k, then terminate and report W_k as top-k result.
Example of NRA(k=1,f=sum)
- STEP 1
- STEP 2
- STEP 3
- STEP 4
NRA Properties
- More generic than TA, since it does not depend on random accesses
- Can be cheaper than TA, if random accesses are very expensive
- NRA accesses objects sequentially from all inputs and updates the upper bounds for all objects seen so far unconditionally.
- Cost: O(n) per access (the expected distinct number of objects accessed so far is O(n))
- No input list is pruned until the algorithm terminates
LARA: LAttice-based Rank Aggregation
- LARA: An efficient NRA implementation
- Based on 3 observations about the top-k candidates
- Operates differently in the two (growing, shrinking) phases
- Takes its name from the lattice used in the shrinking phase
- Extendable to various top-k query variants