In computational geometry and approximation algorithms, a coreset is a small, possibly weighted subset of an input point set that approximately preserves the value of a specified optimization problem. Solving the problem on the coreset yields a solution whose cost is provably close to the optimal solution for the full dataset. Coresets are widely used in geometric optimization, cluster analysis, data streams, and large-scale machine learning to reduce computational complexity while maintaining theoretical guarantees.
Many geometric optimization problems admit coresets of size bounded by a function of the approximation parameter õ and the dimension, but independent of the input size. When such a coreset can be constructed in linear or near-linear time, it yields a polynomial-time approximation scheme (PTAS) or efficient approximation algorithm.
The concept of coresets emerged in the late 1990s and early 2000s within computational geometry, as part of a broader effort to develop approximation schemes for high-dimensional geometric problems. Early work connected coresets to õ-approximations and õ-nets in range spaces and VC dimension theory. Subsequent research extended the framework to clustering, streaming models, and distributed computation. Over time, coresets became a central tool in large-scale data analysis, particularly for clustering and regression problems, where exact computation on massive datasets is computationally infeasible.
Let P be a finite set of points and let f(P, x) denote the cost of a candidate solution x to an optimization problem defined on P. For example, in k-means clustering, x may represent a set of k centers and f(P, x) the sum of squared distances from points in P to their nearest center.
A (strong) õ-coreset for P with respect to f is a (possibly weighted) subset S â P such that for all candidate solutions x,
where õ > 0 is a user-defined approximation parameter.
In many constructions, S is equipped with weights w(p) so that
where c(p, x) is the contribution of point p to the cost.
Some literature distinguishes between:
Coresets are typically constructed using one or more of the following techniques:
For many problems, the size of the coreset is O(g(õ, d)), where d is the dimension and the bound does not depend on the input size n.
Coresets are widely used in clustering problems such as k-means clustering, k-median, and metric k-center . For example, in k-means clustering in Euclidean space, one can construct a coreset of size O(k / õò) (up to logarithmic factors in some settings), independent of n. Running an exact or heuristic clustering algorithm on the coreset yields a (1 + õ)-approximation for the original dataset.
This enables scalable clustering in large datasets and forms the basis of several practical machine learning systems.
Coresets have been developed for problems including:
In low-dimensional settings, coresets often yield polynomial-time approximation schemes.
In regression problems such as least-squares fitting, coresets provide smaller weighted datasets that preserve the objective value. They are also used in:
More recently, coresets have been explored for dataset summarization and acceleration of training in large-scale machine learning pipelines.
In streaming models, data points arrive sequentially and storage is limited. Merge-and-reduce techniques maintain a small coreset whose size depends only on õ and problem parameters. Similarly, in distributed systems, composable coresets allow each machine to compute a local coreset, which can then be combined centrally while preserving approximation guarantees.
Coresets are related to: