Skip to content

Mean

results.mean.MeanResults(mean, count) dataclass

count instance-attribute

mean instance-attribute

results.mean.OnlineMeanResults()

Bases: ResultsHandler[MeanResults]

OnlineMeanResults provides a numerically stable, online (incremental) algorithm to compute the arithmetic mean of large, streaming, or distributed datasets represented as NumPy arrays. In many real-world applications—such a distributed computing or real-time data processing—data is processed in batches, with each batch yielding a partial mean and its corresponding count. This class merges these partial results into a global mean without needing to store all the raw data, thus avoiding issues such as numerical overflow and precision loss.

Suppose the overall dataset is divided into k batches. For each batch \(i\) (where \(1 \leq i \leq k\)), let:

  • \(m_i\) be the partial mean computed over \(n_i\) data points.
  • \(M_i\) be the global mean computed after processing \(i\) batches.
  • \(N_i = n_1 + n_2 + ... + n_i\) be the cumulative count after \(i\) batches.

The arithmetic mean of all data points is defined as:

\[ M_\text{total} = \frac{n_1 m_1 + n_2 m_2 + \ldots + n_k m_k}{n_1 + n_2 + \ldots + n_k} \]

Rather than computing \(M_{total}\) from scratch after processing all data, the class uses an iterative update rule. When merging a new partial result (m_partial, n_partial) with the current global mean M_old (with count n_old), the updated mean is given by:

\[ M_\text{new} = M_\text{old} + \left( m_\text{partial} - M_\text{old} \right) \cdot \frac{n_\text{partial}}{n_\text{old} + n_\text{partial}} \]

This update is mathematically equivalent to the weighted average:

\[ M_\text{new} = \frac{n_\text{old} M_\text{old} + n_\text{partial} m_\text{partial}}{n_\text{old} + n_\text{partial}} \]

but is rearranged to enhance numerical stability. By focusing on the difference (m_partial - M_old) and scaling it by the relative weight n_partial / (n_old + n_partial), the algorithm minimizes the round-off errors that can occur when summing large numbers or when processing many batches sequentially.

The handler starts with no accumulated data. The global mean (global_mean) is initially set to None, and it will be defined by the first partial result received. The total number of observations (total_count) is initialized to zero.

global_mean = None instance-attribute

The current global mean of all processed observations.

total_count = 0 instance-attribute

The total number of observations processed.

add_result(result, *args, **kwargs)

Processes one or more batches of partial results to update the global mean.

PARAMETER DESCRIPTION
batch_results

Results after running Task.

batch_index

An optional index identifier for the batch (for interface consistency, not used in calculations).

get()

Retrieves the final computed global mean along with the total number of observations.

RETURNS DESCRIPTION
MeanResults

A dictionary with the following keys:

  • "mean": A NumPy array representing the computed global mean.
  • "n": An integer representing the total number of observations processed.
RAISES DESCRIPTION
ValueError

If no data has been processed (i.e., global_mean is None or total_count is zero).