Coreset

In computational geometry and approximation algorithms, a coreset is a small, possibly weighted subset of an input point set that approximately preserves the value of a specified optimization problem. Solving the problem on the coreset yields a solution whose cost is provably close to the optimal solution for the full dataset. Coresets are widely used in geometric optimization, cluster analysis, data streams, and large-scale machine learning to reduce computational complexity while maintaining theoretical guarantees.

Many geometric optimization problems admit coresets of size bounded by a function of the approximation parameter ÃÂµ and the dimension, but independent of the input size. When such a coreset can be constructed in linear or near-linear time, it yields a polynomial-time approximation scheme (PTAS) or efficient approximation algorithm.

The concept of coresets emerged in the late 1990s and early 2000s within computational geometry, as part of a broader effort to develop approximation schemes for high-dimensional geometric problems. Early work connected coresets to ÃÂµ-approximations and ÃÂµ-nets in range spaces and VC dimension theory. Subsequent research extended the framework to clustering, streaming models, and distributed computation. Over time, coresets became a central tool in large-scale data analysis, particularly for clustering and regression problems, where exact computation on massive datasets is computationally infeasible.

Definition

Let P be a finite set of points and let f(P, x) denote the cost of a candidate solution x to an optimization problem defined on P. For example, in k-means clustering, x may represent a set of k centers and f(P, x) the sum of squared distances from points in P to their nearest center.

A (strong) ÃÂµ-coreset for P with respect to f is a (possibly weighted) subset S Ã¢ÂÂ P such that for all candidate solutions x,

where ÃÂµ > 0 is a user-defined approximation parameter.

In many constructions, S is equipped with weights w(p) so that

where c(p, x) is the contribution of point p to the cost.

Some literature distinguishes between:

Strong coresets: The approximation guarantee holds uniformly for all candidate solutions x.
Weak coresets: The guarantee holds only for solutions near the optimum.

Construction techniques

Coresets are typically constructed using one or more of the following techniques:

Importance sampling: Points are sampled with probability proportional to their sensitivity (their maximum relative influence on the objective function).
Random sampling and ÃÂµ-approximations: Techniques from VC theory are used to guarantee uniform convergence.
Geometric partitioning: Space is partitioned and representative points are selected from each region.
Merge-and-reduce frameworks: Used in streaming and distributed settings to maintain coresets over time.

For many problems, the size of the coreset is O(g(ÃÂµ, d)), where d is the dimension and the bound does not depend on the input size n.

Applications

Clustering

Coresets are widely used in clustering problems such as k-means clustering, k-median, and metric k-center . For example, in k-means clustering in Euclidean space, one can construct a coreset of size O(k / ÃÂµÃÂ²) (up to logarithmic factors in some settings), independent of n. Running an exact or heuristic clustering algorithm on the coreset yields a (1 + ÃÂµ)-approximation for the original dataset.

This enables scalable clustering in large datasets and forms the basis of several practical machine learning systems.

Geometric optimization

Coresets have been developed for problems including:

Minimum enclosing ball
Facility location
Projective clustering
Shape fitting

In low-dimensional settings, coresets often yield polynomial-time approximation schemes.

Regression and machine learning

In regression problems such as least-squares fitting, coresets provide smaller weighted datasets that preserve the objective value. They are also used in:

Support vector machines
Subspace approximation
Hyperparameter optimization

More recently, coresets have been explored for dataset summarization and acceleration of training in large-scale machine learning pipelines.

Streaming and distributed computation

In streaming models, data points arrive sequentially and storage is limited. Merge-and-reduce techniques maintain a small coreset whose size depends only on ÃÂµ and problem parameters. Similarly, in distributed systems, composable coresets allow each machine to compute a local coreset, which can then be combined centrally while preserving approximation guarantees.

Related concepts

Coresets are related to:

ÃÂµ-approximations and ÃÂµ-nets in range spaces
Randomized sketching techniques
Dimensionality reduction
Sublinear and streaming algorithms