Compare Dataflow Columns
A number of workers need to compare rows, either to sort them, or to test them for equality. Generally, the comparison can be specified in one of two ways:
- By column - Specify the column(s) to be included in the comparison
- The worker generates high performance code to perform the comparison for each row
- Much simpler than by row when used with multiple columns
- Multiple columns can be specified, but cross column calculations cannot be done. Either define a property on the row type with the desired comparison value and specify that property column, or use by row comparison.
- By row - Specify the comparison for the whole row
- Set a default comparison by implementing IComparable<T> or IComparable on the row type, and/or
- Pass an IComparer<T>, Comparison<T>, or Comparison<TLeft, TRight> to the worker, which overrides any default comparison
- Provides complete flexibility in defining the comparison
- Requires more code than by column, especially when checking multiple columns
Note
- If there is no explicit or default comparison available when row comparison is attempted, a runtime error will occur.
- This article describes how to specify the comparison when using existing workers. For implementing this type of functionality in your own custom workers, and for further details, please see Custom Dataflow Column Mapping.
Compare By Column Examples
Workers like SortTransform<TInputOutput> compare one or more columns in one row with the same columns in a different row of the same type. The library user can specify which columns should be included in the comparison by providing an action that executes commands from IRowComparerCommand<T>.
For instance, to sort the rows first on column "Budget" in descending order, then on column "Department" in ascending order:
public class Budgets
{
public double Budget { get; set; }
public string Department { get; set; }
}
// ...
var transform = budgetSource.Output.Link.SortTransform(
"Sort Budgets"
, compare => compare.Desc("Budget").Asc("Department")
);
// ...
Workers like InnerJoinMergeSortedTransform<TLeftInput, TRightInput, TOutput> compare one or more columns in one row with a different row, potentially of a different type with different columns. The library user can specify which columns should be included in the comparison by providing an action that executes commands from IRowComparerCommand<TLeft, TRight>.
For instance, to compare column "Dept" in one input with column "Organization.Department" (i.e. using a column schema) in a second input, in ascending order:
compare => compare.Asc("Dept", "Organization.Department")
For data sources without any intrinsic column names, column ordinal position can be used. Furthermore, with multiple columns, the ordering is from the highest to the lowest priority, so to sort first on "Budget", then on "Name":
compare => compare.Desc("5", "Budget").Asc("1", "Name")
Other options include specifying a StringComparer (which has several predefined string comparers), or a Comparison<T> (which provides complete freedom in defining how a particular column is compared):
compare => compare
.Asc("Department", StringComparer.CurrentCultureIgnoreCase)
.Asc("Name", (string l, string r) => l.Length.CompareTo(r.Length))
Note
- The default comparison for string column data is case sensitive and uses the current culture
- Do specify explicitly how string columns are compared (e.g. by supplying a
StringComparer
), to minimize issues if e.g. the host current culture changes - Please see Best Practices for Using Strings in .NET for details on string comparisons, current culture etc.
Column Name Mapping Rules
When column names are mapped, the following rules apply:
- Column name matching is ordinal case insensitive, but a case
sensitive match takes precedence over a case insensitive match
- Note that some data sources change the case of identifiers, some even if quoted (e.g. PostgreSQL)
- Only their suffixes must match, i.e. one can leave out name parts from the left;
"X.A"
could match with e.g."A"
,"X.A"
,"Y.A"
,"Z.Y.A"
- If there are multiple matches, more matching parts take precedence over fewer matching
parts (irrespective of casing);
"X.A"
would match to"X.a"
ahead of"A"
Note
- To resolve duplicate matches:
- Specify additional or all name parts in explicit mapping
- Change the column (or schema) names to avoid name clashes
- Unlike with Dataflow Column Mapping, no automatic match of column names is done when comparing columns.
Compare By Row Examples
Here we set a default comparison for a row type by implementing IComparable<T> on the row type, and later override it for particular workers by passing either an IComparer<T> or a Comparison<T> to the worker:
using actionETL;
using System;
using System.Collections.Generic;
public class Budgets : IComparable<Budgets>
{
public double Budget { get; set; }
public double Sales { get; set; }
// Sort on ascending Budget
public int CompareTo(Budgets other) =>
other is null || this.Budget > other.Budget ? 1 : this.Budget < other.Budget ? -1 : 0;
// Note: Although not used for the row comparison, best practice is to also
// implement Equals(), GetHashCode(), and operators to match with CompareTo(),
// since otherwise they will give a different result vs. CompareTo().
}
// ...
var transform0 = budgetSource0.Output.Link.SortTransform(
"Sort using default CompareTo() comparison"
);
var transform1 = budgetSource1.Output.Link.SortTransform(
"Sort on (Sales - Budget) by using a supplied comparer"
, Comparer<Budgets>.Create((Budgets x, Budgets y) =>
(x.Sales - x.Budget).CompareTo(y.Sales - y.Budget))
);
// Here the comparison is done inline, which can be slightly faster vs.
// calling CompareTo().
var transform2 = budgetSource2.Output.Link.SortTransform(
"Sort on (Sales - Budget) by using a supplied comparison delegate"
, (Budgets x, Budgets y) =>
x.Budget < y.Budget ? -1 : x.Budget > y.Budget ? 1 : 0
);
// ...
You can also use RowComparer<T> to generate the implementation
of CompareTo()
:
using actionETL;
using System;
public partial class Person : IComparable<Person>
{
public int Age { get; set; }
public string FirstName { get; set; }
public string LastName { get; set; }
private static readonly Comparison<Person> _comparison =
new RowComparer<Person>(
cb => cb.Asc(nameof(LastName)).Asc(nameof(FirstName))
).Comparison;
public int CompareTo(Person other) => _comparison(this, other);
}
Override Other Comparison Members
Although the out of box workers only use the CompareTo()
method for the row comparison,
it is best practice to also implement Equals()
, GetHashCode()
, and operators to match
with CompareTo(), since otherwise they will give a different result vs. CompareTo()
.
Note
It is highly recommended to use an analyzer to both indicate these members should be implemented, and to generate the implementation automatically. Either use the Microsoft Code Analysis 2017 or 2019 Visual Studio extensions, or install the Microsoft.CodeAnalysis.FxCopAnalyzers NuGet package in your project.
In the below example these members have been overridden on the Person
class.
public partial class Person : IComparable<Person>
{
public override bool Equals(object obj) =>
obj is Person p && _comparison(this, p) == 0;
public override int GetHashCode()
{
unchecked // allow "wrap around" calculating the returned int
{
return LastName.GetHashCode() + 31 * FirstName.GetHashCode();
}
}
public static bool operator ==(Person left, Person right)
{
if (left is null)
return right is null;
return left.Equals(right);
}
public static bool operator !=(Person left, Person right) =>
!(left == right);
public static bool operator <(Person left, Person right)
{
if (left is null)
return !(right is null);
return left.CompareTo(right) < 0;
}
public static bool operator <=(Person left, Person right) =>
left is null || left.CompareTo(right) <= 0;
public static bool operator >(Person left, Person right)
{
if (left is null)
return false;
return left.CompareTo(right) > 0;
}
public static bool operator >=(Person left, Person right) =>
left is null ? right is null : left.CompareTo(right) >= 0;
}