Dataflow Column Mapping
This article describes how to map and copy columns and rows when using existing workers. Examples include specifying which dataflow columns to copy to an external data source in a target worker, or which upstream columns to copy to which downstream columns in a transform worker.
Note
For implementing this type of functionality in your own workers, and for further details, please see Custom Dataflow Column Mapping.
Mapping Examples
Workers like AdbInsertTarget<TInputError> and the join transforms use the mapping facilities to map or copy columns and rows. Column copies are done with high-performance generated code, automatically using the appropriate technique for each data type.
For most such workers, the library user specifies what the worker should do by providing an action that executes commands from IColumnMapperCommand or IRowMapperCommand, e.g. mapping or copying from column "Dept" to column "Department":
// ...
source.Output.Link.AdbInsertTarget(
"Insert Employees"
, map => map.Name("Dept", "Department")
, acs.CreateConnectionBuilder()
, targetTableName
);
Data sources without any intrinsic column names will get the column ordinal position as
column name, i.e. "0", "1" etc. Each data source controls how ordinal positions
are assigned - e.g. for CSV files it would be in column order. This allows using mapping
by name also for these data sources. Specifying multiple mappings is done by
chaining method calls:
map => map.Name("5", "DepartmentId").Name("1", "Department")
Some workers also use row copiers to copy whole rows, which is mutually exclusive with copying individual columns:
map => map.Row()
Furthermore, row copiers can map or copy multiple columns automatically based on
column names being exactly or partly the same, and (optionally) assert number of
matches found (5 in this case):
map => map.AutoName(5)
One can also restrict the auto-mapping to specific column schemas to avoid name
collisions, e.g. only map from descendants of "TransactionInfo", to descendants
of "Info.Transaction", again (optionally) asserting number of
matches found (3 in this case):
map => map.AutoName(3, "TransactionInfo", "Info.Transaction")
Column Name Mapping Rules
When column names are mapped (explicitly or automatically), the following rules apply:
- Column name matching is ordinal case insensitive, but a case
sensitive match takes precedence over a case insensitive match
- Note that some data sources change the case of identifiers, some even if quoted (e.g. PostgreSQL)
- Mapping the row as a single unit is mutually exclusive with commands mapping individual columns
- For most workers and unless otherwise stated, a from column can be mapped multiple times to different to columns, but a to column can only have a single from column mapped to it
- Only their (dot separated) suffixes must match, i.e. one can leave out name parts from
the left;
"X.A"could match with e.g."A","X.A","Y.A","Z.Y.A" - If there are multiple matches, more matching parts take precedence over fewer matching
parts (irrespective of casing);
"X.A"would match to"X.a"ahead of"A" - Explicit mapping can refer to column schemas (i.e. multiple columns); if both sides
have columns
"X.A"and"X.B", mapping"X"maps it as a whole unit rather than as individual columns - For auto-mapping only:
- If multiple matches remain, a runtime error is raised
Note
To resolve duplicate matches:
- Specify additional or all name parts in explicit mapping
- Change the column (or schema) names in the from and/or to rows to avoid name clashes
- Restrict auto-mapping to specific column schemas
- Map offending columns explicitly by name (which will exclude them from later auto-mappings)
Auto-mapping Names
Auto-mapping can significantly simplify column mapping when column names have been aligned, especially with larger number of columns. It is good practice to give columns their final name as far upstream as possible, and to assert the number of auto-mapped columns (to guard against issues when changing columns).
Some workers default to auto-mapping columns by name, where this is a common and
generally safe approach. For others, the library user must explicitly specify the
mapping command, e.g. AutoName().
Auto-mapping examples:
| From Column Names | To Column Names | Mapping Outcome | Comment |
|---|---|---|---|
| "FirstName", "Name" | "Name" | "Name" -> "Name" | Only maps dot-separated name parts |
| "Person.Name.FirstName" | "firstname" | "Person.Name.FirstName" -> "firstname" | Right-most part is (case insensitive) identical |
| "Person.Name.FirstName", "FirstName" | "FirstName" | Duplicate Mapping Error | Two matches with same number (1) of matching parts |
| "Person.Name.FirstName", "firstname" | "Name.firstname" | "Person.Name.FirstName" -> "Name.firstname" | Two matching parts takes precedence over exact case and one matching part |
| "Person.Name.FirstName", "FirstName" | "Name.FirstName", "FirstName" | "Person.Name.FirstName" -> "Name.FirstName", "FirstName" -> "FirstName" | Two matching parts takes precedence over one matching part |
Note
Another good alternative for simplifying column mapping is to group columns into one or more Column Schemas, and explicitly map the schema(s), which would copy all columns in the schema(s).
Map Column to Name
As one example, the AdbInsertTarget<TInputError> worker uses FromTypeColumnMapper<TFrom> to map from dataflow columns to external database columns by name. It allows specifying which upstream dataflow columns should be mapped to which external downstream columns, when inserting rows into the database, here done in three different variations:
public class Person
{
public int Age { get; set; }
public string FirstName { get; set; }
public string LastName { get; set; }
}
/* SQL table:
CREATE TABLE Person (
Years int NULL,
FirstName nvarchar(50) NULL,
LastName nvarchar(50) NULL
)
*/
// ...
// No mapping specified => auto-map names => map "FirstName" and "LastName":
source0.Output.Link.AdbInsertTarget(
"Insert Employees 0"
, acs.CreateConnectionBuilder()
, targetTableName
);
// Map from "Age" to "Years", and auto-map (assert exactly 2) other names,
// i.e. "FirstName" and "LastName":
source1.Output.Link.AdbInsertTarget(
"Insert Employees 1"
, map => map.Name("Age", "Years").AutoName(2)
, acs.CreateConnectionBuilder()
, targetTableName
);
// Specify all columns explicitly. Only one parameter is needed when
// (the last part of) the name matches:
source2.Output.Link.AdbInsertTarget(
"Insert Employees 2"
, map => map.Name("Age", "Years").Name("FirstName").Name("LastName")
, acs.CreateConnectionBuilder()
, targetTableName
);
Map Name To Column
ToTypeColumnMapper<TTo> is similar to the above FromTypeColumnMapper,
except it instead maps from external source names to dataflow columns. This is often used
by dataflow source workers, e.g. AdbDataReaderSource<TOutput>.
Map Column To Column
Transforms like InnerJoinMergeSortedTransform<TLeftInput, TRightInput, TOutput> map columns between two inputs (for specifying the join), and copy columns from upstream rows to downstream rows.
Rows Without struct Column Schemas
In this example, the Person row type has DepartmentName and DepartmentCostCenter
(empty) columns already present, and the inner join populates these columns from a
Department row source:
public class Person
{
public int PersonId;
public string PersonName;
public int DepartmentId;
public string DepartmentName;
public int DepartmentCostCenter;
}
public class Department
{
public int DepartmentId;
public string DepartmentName;
public int DepartmentCostCenter;
}
// ...
var transform = personSource.Output.Link
.InnerJoinMergeSortedTransform<Person, Department, Person>(
"Inner Join Person and Department"
, departmentSource.Output
, compare => compare.Asc("DepartmentId") // Join on DepartmentId
// Pass Person input rows to output:
, mapPerson => mapPerson.Row()
// Copy DepartmentName & DepartmentCostCenter from Department input:
, mapDept => mapDept.Name("DepartmentName").Name("DepartmentCostCenter")
);
// ...
Alternatively, by specifying individual columns (instead of Row()) for both the left
and right row copiers, the same data would be output, but instead of passing on
Person input rows, new output rows will be automatically created (by calling the
Person parameterless constructor), before columns are copied from both input rows to
output row:
// Copy PersonId, PersonName & DepartmentId input columns to output:
, mapPerson => mapPerson.Name("PersonId").Name("PersonName").Name("DepartmentId")
// Copy DepartmentName & DepartmentCostCenter from Department to output:
, mapDept => mapDept.Name("DepartmentName").Name("DepartmentCostCenter")
Note
Columns can also be copied to a column with a different name, as well as auto-mapped by name, see Map Column To Name.
Rows With struct Column Schemas
In the previous example, logically the join adds a DepartmentName and a
DepartmentCostCenter column to the Person schema. Any time a worker logically
adds or removes columns, one should consider implementing this explicitly, either by
using Column Schemas, or (more rarely)
with Row Type Inheritance.
This example uses struct column schemas:
public struct PersonSchema // Reusable column schema
{
public int PersonId;
public string PersonName;
}
public class Person // Left input
{
public PersonSchema PersonSchema;
public int DepartmentId;
}
public struct DepartmentSchema // Reusable column schema
{
public int DepartmentId;
public string DepartmentName;
public int DepartmentCostCenter;
}
public class Department // Right input
{
public DepartmentSchema DepartmentSchema;
}
public class PersonDepartment // Output, reusing schemas
{
public PersonSchema PersonSchema;
public DepartmentSchema DepartmentSchema;
}
// ...
var transform = personSource.Output.Link
.InnerJoinMergeSortedTransform<Person, Department, PersonDepartment>(
"Inner Join Person and Department"
, departmentSource.Output
// Join Person.DepartmentId with Department.DepartmentSchema.DepartmentId:
, compare => compare.Asc("DepartmentId")
// Copy whole Person struct to output:
, mapPerson => mapPerson.Name(nameof(Person.PersonSchema))
// Copy whole Department struct to output:
, mapDept => mapDept.Name(nameof(Department.DepartmentSchema))
);
// ...
Note
- One column schema can be reused in many different row types
- Column schemas allow manipulating multiple columns within a row as a group
- The more columns a schema has, the bigger the benefit becomes from reusing it - mainly in terms of smaller code size and easier maintenance
- Using
structcolumn schemas makes it simple to avoid having unused columns in a row (as is the case with e.g.Person.DepartmentCostCenterin the previous example), reducing the risk for bugs
Duplicating Columns
Some column types require a deep copy
when being duplicated (see
Row Ownership rules), which is
available for data types with multi-copy support.
This is the case in the next example, where a byte[] is copied to two output columns.
Although the library user usually doesn't need to be aware of this (workers can use TypeColumnCopier<TFrom, TTo> to automate this), under the hood:
- For best performance, the first copy (to
Widget.Datasince it is listed first) will be a shallow copy - the inputbyte[]instance reference will be copied to the output column - The second and any subsequent copies of the same column for this worker,
will be a deep copy - a new
byte[]instance will be allocated, and the contents of the input array will be copied to the newWidget.OriginalDataarray instance
public class Widget
{
public int WidgetId;
public byte[] Data;
public byte[] OriginalData;
}
public class Custom
{
public int WidgetId;
public byte[] Data;
}
// ...
var transform = widgetSource.Output.Link
.InnerJoinMergeSortedTransform<Widget, Custom, Widget>(
"Inner Join Widget and Custom"
, customSource.Output
, compare => compare.Asc("WidgetId") // Join on WidgetId
// Pass Widget input rows to output:
, mapWidget => mapWidget.Row()
// Custom.Data will be shallow copied to Widget.Data,
// and deep copied to Widget.OriginalData:
, mapCustom => mapCustom.Name("Data").Name("Data", "OriginalData")
);
// ...
Prohibited Duplication
A type can also support a single shallow copy per worker (e.g. for passing it from
upstream to downstream), but prohibit any further copies
(TypeSchemaCopyOption = SingleShallow).
Specifying it to be duplicated multiple times would generate a runtime error.
Again, Dataflow Columns describes which types support multi-copy.
Custom Code
Some dataflow workers (e.g. InnerJoinMergeSortedTransform<TLeftInput, TRightInput, TOutput>) allow the library user to either use the approach detailed above to configure how to copy columns, or to use custom code to do the copying explicitly.
Other workers (e.g. RowActionTransform<TInputError, TOutput>) only provide the option to use custom code.
See Also
- Dataflow
- Dataflow Rows
- Dataflow Columns
- Compare Dataflow Columns
- Mapping commands: IRowMapperCommand, IColumnMapperCommand
- Row mappers and mapping result:
- Row copier: TypeColumnCopier<TFrom, TTo>