Dataflow Aggregations
This article describes the predefined aggregation functions and grouping facilities available
with the AggregateTransform
workers. Also see the
AggregateCsvCreateInsertTable
sample. Custom dataflow aggregations are instead described in
Custom Dataflow Aggregations.
Note
- By default, .NET Framework maximum array size is 2GB, which in turn with a 64-bit application limits
the number of unique groupings to a maximum of 47.9 million, and the
CountDistinct
number of distinct values per column to a maximum of just over 89 million. You can remove these limits by enabling support for larger arrays, as described in <gcAllowVeryLargeObjects> Element.- .NET Core and .NET 5+ has a similar setting, but already defaults to support >2GB: COMPlus_gcAllowVeryLargeObjects
- Column name matching is ordinal case insensitive, but a case sensitive match takes precedence over a case insensitive match.
Predefined Column Aggregations
The following predefined column aggregations, described in IAggregationCommand,
are available: Average
, Count
, CountDistinct
, CountRows
, First
, Last
, Max
, Min
and Sum
.
Use AggregateTransform<TInputAccumulateOutput>
with
or
without
grouping when input and output rows have the same type.
Use AggregateTransform<TInput, TAccumulateOutput>
with
or without
grouping if types are different. Here's an example that both groups and aggregates rows:
// using actionETL;
// using System;
// using System.Collections.Generic;
private class In
{
public int Category;
public int Metric;
}
private readonly In[] _sampleData = new In[]
{
new In { Category = 1, Metric = 2 }
, new In { Category = 1, Metric = 2 }
, new In { Category = 1, Metric = 5 }
, new In { Category = 2, Metric = 10 }
};
private class AccOut
{
public int Category;
public double Average;
public long Count;
public long CountDistinct;
public long CountRows;
public int First;
public int Last;
public int Max;
public int Min;
public double Sum;
}
public WorkerSystem RunExample()
{
var workerSystem = new WorkerSystem()
.Root(ws =>
{
var source = new EnumerableSource<In>(ws, "Source", _sampleData)
.Output.Link.AggregateTransform2<In, AccOut>(
"Transform"
, ac => ac
.Average(nameof(In.Metric), nameof(AccOut.Average))
.Count(nameof(In.Metric), nameof(AccOut.Count))
.CountDistinct(nameof(In.Metric), nameof(AccOut.CountDistinct))
.CountRows(nameof(AccOut.CountRows))
.First(nameof(In.Metric), nameof(AccOut.First))
.Last(nameof(In.Metric), nameof(AccOut.Last))
.Max(nameof(In.Metric), nameof(AccOut.Max))
.Min(nameof(In.Metric), nameof(AccOut.Min))
.Sum(nameof(In.Metric), nameof(AccOut.Sum))
, g => g.Name(nameof(In.Category)) // Group by column [Category]
)
// Receives two rows:
// { Category=1, Average=3.0, Count=3, CountDistinct=2, CountRows=3
// , First=2, Last=5, Max=5, Min=2, Sum=9 }
// { Category=2, Average=10.0, Count=1, CountDistinct=1, CountRows=1
// , First=10, Last=10, Max=10, Min=10, Sum=10 }
.Output.Link.CollectionTarget();
});
workerSystem.Start().ThrowOnFailure();
return workerSystem;
}
Predefined Row Aggregations
A number of predefined row aggregations are available, such as First
, Last
, Single
etc.,
see RowAggregationFunction for details.
Create these
with IGroupByCommand grouping,
with IEqualityComparer grouping,
or without grouping.
Here's an example that selects the last input row in each group:
// using actionETL;
// using System;
// using System.Collections.Generic;
private class Row
{
public int Category;
public int Metric;
}
private readonly Row[] _sampleData = new Row[]
{
new Row { Category = 1, Metric = 2 }
, new Row { Category = 1, Metric = 2 }
, new Row { Category = 1, Metric = 5 }
, new Row { Category = 2, Metric = 10 }
};
public WorkerSystem RunExample()
{
var workerSystem = new WorkerSystem()
.Root(ws =>
{
var source = new EnumerableSource<Row>(ws, "Source", _sampleData)
.Output.Link.AggregateTransform1(
"Transform"
, RowAggregationFunction.Last
, g => g.Name(nameof(Row.Category)) // Group by column [Category]
)
// Receives two rows:
// { Category = 1, Metric = 5 }
// { Category = 2, Metric = 10 }
.Output.Link.CollectionTarget();
});
workerSystem.Start().ThrowOnFailure();
return workerSystem;
}
See Also
- Dataflow
- Dataflow Rows
- Dataflow Columns
- Dataflow Aggregations