Skip to content

Mean

OnlineMeanResultHandler()

Bases: BaseResultHandler

OnlineMeanResultHandler provides a numerically stable, online (incremental) algorithm to compute the arithmetic mean of large, streaming, or distributed datasets represented as NumPy arrays. In many real-world applications—such a distributed computing or real-time data processing—data is processed in chunks, with each chunk yielding a partial mean and its corresponding count. This class merges these partial results into a global mean without needing to store all the raw data, thus avoiding issues such as numerical overflow and precision loss.

Suppose the overall dataset is divided into k chunks. For each chunk \(i\) (where \(1 \leq i \leq k\)), let:

  • \(m_i\) be the partial mean computed over \(n_i\) data points.
  • \(M_i\) be the global mean computed after processing \(i\) chunks.
  • \(N_i = n_1 + n_2 + ... + n_i\) be the cumulative count after \(i\) chunks.

The arithmetic mean of all data points is defined as:

\[ M_\text{total} = \frac{n_1 m_1 + n_2 m_2 + \ldots + n_k m_k}{n_1 + n_2 + \ldots + n_k} \]

Rather than computing \(M_{total}\) from scratch after processing all data, the class uses an iterative update rule. When merging a new partial result (m_partial, n_partial) with the current global mean M_old (with count n_old), the updated mean is given by:

\[ M_\text{new} = M_\text{old} + \left( m_\text{partial} - M_\text{old} \right) \cdot \frac{n_\text{partial}}{n_\text{old} + n_\text{partial}} \]

This update is mathematically equivalent to the weighted average:

\[ M_\text{new} = \frac{n_\text{old} M_\text{old} + n_\text{partial} m_\text{partial}}{n_\text{old} + n_\text{partial}} \]

but is rearranged to enhance numerical stability. By focusing on the difference (m_partial - M_old) and scaling it by the relative weight n_partial / (n_old + n_partial), the algorithm minimizes the round-off errors that can occur when summing large numbers or when processing many chunks sequentially.

The handler starts with no accumulated data. The global mean (global_mean) is initially set to None, and it will be defined by the first partial result received. The total number of observations (total_count) is initialized to zero.

global_mean = None

The current global mean of all processed observations.

total_count = 0

The total number of observations processed.

add_chunk(chunk_results, chunk_index=None, *args, **kwargs)

Processes one or more chunks of partial results to update the global mean.

PARAMETER DESCRIPTION
chunk_results

Either a single tuple (partial_mean, count) or a list of such tuples. Each partial result must be a tuple consisting of: - A NumPy array representing the partial mean of a data chunk. - An integer representing the count of observations in that chunk.

TYPE: list[tuple[NDArray[float64], int]] | tuple[NDArray[float64], int]

chunk_index

An optional index identifier for the chunk (for interface consistency, not used in calculations).

TYPE: int | None DEFAULT: None

finalize(saver)

Finalizes the accumulation process and persists the final global mean if a saver is provided.

This method should be called when no more data is expected. It ensures that the final computed mean and the total observation count are saved.

PARAMETER DESCRIPTION
saver

An instance of a Saver to save the final result, or None.

TYPE: Saver | None

get_results()

Retrieves the final computed global mean along with the total number of observations.

RETURNS DESCRIPTION
dict[str, NDArray[float64] | int]

A dictionary with the following keys:

  • "mean": A NumPy array representing the computed global mean.
  • "n": An integer representing the total number of observations processed.
RAISES DESCRIPTION
ValueError

If no data has been processed (i.e., global_mean is None or total_count is zero).

periodic_save_if_needed(saver, save_interval)

Persists the current global mean at periodic intervals if a saver is provided.

This method is useful in long-running or distributed applications where saving intermediate results is necessary for fault tolerance. It checks if the total number of observations processed meets or exceeds the save_interval and, if so, invokes the save method of the provided saver.

PARAMETER DESCRIPTION
saver

An instance of a Saver, which has a save method to persist data, or None.

TYPE: Saver | None

save_interval

The threshold for the total number of observations to trigger a save.

TYPE: int