Mean
OnlineMeanResultHandler()
¶
Bases: BaseResultHandler
OnlineMeanResultHandler
provides a numerically stable, online (incremental)
algorithm to compute the arithmetic mean of large, streaming, or distributed
datasets represented as NumPy arrays. In many real-world applications—such a
distributed computing or real-time data processing—data is processed
in chunks, with each chunk yielding a partial mean and its corresponding count.
This class merges these partial results into a global mean without needing to
store all the raw data, thus avoiding issues such as numerical overflow and
precision loss.
Suppose the overall dataset is divided into k chunks. For each chunk \(i\) (where \(1 \leq i \leq k\)), let:
- \(m_i\) be the partial mean computed over \(n_i\) data points.
- \(M_i\) be the global mean computed after processing \(i\) chunks.
- \(N_i = n_1 + n_2 + ... + n_i\) be the cumulative count after \(i\) chunks.
The arithmetic mean of all data points is defined as:
Rather than computing \(M_{total}\) from scratch after processing all data,
the class uses an iterative update rule. When merging a new partial result
(m_partial, n_partial)
with the current global mean M_old
(with count
n_old
), the updated mean is given by:
This update is mathematically equivalent to the weighted average:
but is rearranged to enhance numerical stability. By focusing on the
difference (m_partial - M_old)
and scaling it by the relative weight
n_partial / (n_old + n_partial)
, the algorithm minimizes the round-off
errors that can occur when summing large numbers or when processing many
chunks sequentially.
The handler starts with no accumulated data. The global mean (global_mean
) is initially
set to None, and it will be defined by the first partial result received. The total number
of observations (total_count
) is initialized to zero.
global_mean = None
¶
The current global mean of all processed observations.
total_count = 0
¶
The total number of observations processed.
add_chunk(chunk_results, chunk_index=None, *args, **kwargs)
¶
Processes one or more chunks of partial results to update the global mean.
PARAMETER | DESCRIPTION |
---|---|
chunk_results
|
Either a single tuple (partial_mean, count) or a list of such tuples. Each partial result must be a tuple consisting of: - A NumPy array representing the partial mean of a data chunk. - An integer representing the count of observations in that chunk.
TYPE:
|
chunk_index
|
An optional index identifier for the chunk (for interface consistency, not used in calculations).
TYPE:
|
finalize(saver)
¶
Finalizes the accumulation process and persists the final global mean if a saver is provided.
This method should be called when no more data is expected. It ensures that the final computed mean and the total observation count are saved.
PARAMETER | DESCRIPTION |
---|---|
saver
|
An instance of a Saver to save the final result, or None.
TYPE:
|
get_results()
¶
Retrieves the final computed global mean along with the total number of observations.
RETURNS | DESCRIPTION |
---|---|
dict[str, NDArray[float64] | int]
|
A dictionary with the following keys:
|
RAISES | DESCRIPTION |
---|---|
ValueError
|
If no data has been processed (i.e., |
periodic_save_if_needed(saver, save_interval)
¶
Persists the current global mean at periodic intervals if a saver is provided.
This method is useful in long-running or distributed applications where saving
intermediate results is necessary for fault tolerance. It checks if the total
number of observations processed meets or exceeds the save_interval
and,
if so, invokes the save method of the provided saver.
PARAMETER | DESCRIPTION |
---|---|
saver
|
An instance of a Saver, which has a save method to persist data, or None.
TYPE:
|
save_interval
|
The threshold for the total number of observations to trigger a save.
TYPE:
|