Search Results for

    Show / Hide Table of Contents

    Dataflow Columns

    Dataflow Rows contain columns, which are public fields and properties that hold the individual values in the row (see IsColumn).

    Note
    • Only public instance fields and properties are columns. Avoid using non-public or static fields or properties in rows, since the dataflow infrastructure will ignore them.
    • Both regular properties and auto properties are supported.
    • Properties can be read-write, read-only, or write-only (see CanRead and CanWrite).

    The row class type acts as a schema for the row, i.e. a specification of what columns the row holds, and what the individual column data types are.

    Column Data Types

    A dataflow column (i.e. data sent via ports between dataflow workers) can have virtually any data type. The only restriction is that there are no circular references, e.g. a dataflow row type MyRow cannot contain a column with the MyRow type.

    Note

    A different aspect is how dataflow column data types translate to external data types. For SQL databases, see Database Data Types and Adb Supported Databases.

    Multi-copy Support

    Each dataflow column type has a TypeSchemaCopyOption, which describes whether and how the type can be automatically copied (i.e. cloned). This multi-copy support is available for Shallow and SingleShallowThenDeep types, and is required by the following dataflow workers (if the column in question is included in the output):

    • All *join* transforms (InnerJoinMergeSortedTransformFactory etc.)
    • MulticastTransformFactory
    • RepeatRowsSource<TOutput>

    Column types without multi-copy support (SingleShallow types, which is also the default for unknown types) can, on the other hand, only flow or be automatically copied to a single downstream column. The same default therefore applies to whole rows which contain one or more columns without multi-copy support; these rows can flow to one of any number of downstream workers, but each row cannot be automatically duplicated.

    Note
    • Do use only multi-copy column types if possible since it simplifies using the above workers
    • When using column types without multi-copy support, you can still use the above workers by providing the copying logic yourself
    • Dataflow Column Mapping shows how automatic column copying is specified

    Multi-copy Types

    The following types have multi-copy support; language specific type names (e.g. C# int, VisualBasic Integer etc.) work equally well:

    • Shallow copied:
      • Boolean, Byte, Char, DateTime, DateTimeOffset, Decimal, Double, Guid, Int16, Int32, Int64, PhysicalAddress, SByte, Single, String, TimeSpan, UInt16, UInt32, UInt64
      • Enumerations, both built-in and user (enum MyEnum { ... } etc.)
      • Many database provider-specific data types (e.g. System.Data.SqlTypes.SqlDateTime), see the Npgsql and SqlClient Adb providers.
      • For the value types above, this includes the corresponding Nullable<T> types, e.g. Nullable<int>, Nullable<MyEnum> (or int?, MyEnum? in C#).
    • SingleShallowThenDeep copied:
      • Arrays of all Shallow copied types referenced above, i.e. Byte[], DateTime[,], MyEnum[,,] etc.
      • Memory<Byte>, Memory<Byte>?
      • Some database provider-specific data types (e.g. System.Data.SqlTypes.SqlBytes), see the Npgsql and SqlClient Adb providers.
      • RowErrorCollection - Note that to simplify exception reporting, any included exceptions are treated as immutable even though they are technically mutable. Therefore, never modify or rethrow an exception stored in a RowErrorCollection (or CaptureRowErrors) after it has passed through a worker that might duplicate it.

    Unknown Column Names and Types

    In some cases the number, names and types of source data columns might not be known upfront, e.g. when reading a CSV file with initial known columns followed by an arbitrary number of unknown columns. It is almost always best to store this data in static column types, e.g.:

    • Retain the unknown source columns in a single string dataflow column, and only parse out the data when needed
    • Parse the unknown source columns into either of:
      • JSON, stored in a single string dataflow column
      • Single Dictionary<string,string> dataflow column with name-value pairs
      • One string[] with column names stored outside the dataflow (since it doesn't change until reading the data source again), and one string[] dataflow column with the values

    Dynamic Objects

    The dataflow also supports using dynamic (or Object with late binding in Visual Basic) both as a row type and as a column type. This approach can be useful in rare cases but comes with several disadvantages:

    • The dynamic type becomes a black box and you have to do more work yourself:
      • Create the instances (e.g. ExpandoObject) and set their members
      • Access to dynamic members (i.e. 'pseudo columns') must be coded explicitly, e.g. sorting or joining on a column (you can't use the mapping or compare facilities), duplicating dynamic instances (dynamic becomes SingleShallow), and writing to any destinations
    • Member access can't be type checked at compile time and any type errors will throw at runtime
    • Accessing dynamic type members is much slower than accessing static types
    Note

    One could envision using dynamic as a way to avoid creating a row class when dealing with a very large number of columns and column types, but this would again have the disadvantages listed above. Instead, for this scenario consider whether automating the generation of the class code is possible and preferable.

    Type Workarounds

    If the extensive type and conversion support in the dataflow and data sources are not sufficient, there are several possible workarounds.

    Change to Supported Type

    The simplest approach is to change the dataflow and/or data source types to ones that are supported, while ensuring the new types have enough precision etc. to still store all required values.

    Use an Intermediate Format

    When reading the source data type, convert it to one that is supported, e.g. a regular, JSON, or XML string, transport and/or process this intermediate format, and if needed convert back to the desired type when outputting to a target.

    Add a Conversion Property

    Add a conversion property to the row class so that the column in question can be read and written as a different type where needed.

    For instance, all Adb providers can read and write enum values from and to integer database columns, except the Adb Npgsql Provider which can only read them, not write them to integer columns (due to having its own enum processing logic.)

    This example shows how to use a conversion property to still store the enum as an integer (see the Ranking property and column), while retaining the ability to use it as an enum in the dataflow (see the RankingAsEnum property.)

    using actionETL;
    using actionETL.Adb;
    using actionETL.Adb.NpgsqlExternal;
    using System.Threading.Tasks;
    
    public static partial class AdbNpgsqlEnumCast
    {
        public enum Ranking { Low, Mid, High }
    
        private sealed class Category
        {
            public string CategoryName { get; set; }
    
            public int Ranking { get => (int)RankingAsEnum; set => RankingAsEnum = (Ranking)value; }
            public Ranking RankingAsEnum { get; set; }
        }
    
        private const string _sourceTableName = "CategorySource";
        private const string _targetTableName = "Category";
    
        private static async Task<SystemOutcomeStatus> RunExampleAsync()
        {
            return await new WorkerSystem()
                .Root(root =>
                {
                    var acs = AdbNpgsqlProvider.Get()
                        .CreateConnectionString(root.Config["NpgsqlTester"]);
    
                    _ = new AdbDataReaderSource<Category>(root, "Extract Table"
                        , acs.CreateConnectionBuilder()
                        , "SELECT CategoryName, Ranking FROM " + _sourceTableName
                        , cmca => cmca.AutoName())
    
                    .Output.Link.RowActionTransform1("Transform", row =>
                    {
                        if (row.RankingAsEnum == Ranking.Mid)
                            row.RankingAsEnum = Ranking.High;
                    })
    
                    .Output.Link.AdbInsertTarget("Insert Table"
                        , acs.CreateConnectionBuilder(), _targetTableName);
                })
                .StartAsync().ConfigureAwait(false);
        }
    }
    
    /* Expects these SQL tables:
            CREATE TABLE CategorySource
            (
                CategoryName varchar(50),
                Ranking int   -- Note the 'int' data type
            )
            CREATE TABLE Category
            (
                CategoryName varchar(50),
                Ranking int   -- Note the 'int' data type
            )
    */
    

    Column Names

    This row example has all columns at the same level (i.e. without any struct column schema grouping, see below); the column names must therefore be unique within the row:

    public class Person
    {
        public int Age;
        public string FirstName;
        public string MiddleName;
        public string LastName;
        public string FullName;
    }
    
    // ...
    
    var person = new Person();
    person.FirstName = "Kevin";
    
    // ...
    
    var transform = source.Output.Link.SortTransform(
        parent, "Sort", cb => cb.Asc("FirstName"));
    

    Where possible, do use the same names for both the columns in the data row and for external columns (such as database and spreadsheet columns), since many workers can automatically map columns if they have the same name.

    Column Name Casing

    Different data sources (and .NET types) have different column identifier case policies, e.g. case sensitive, case insensitive, fold to lower case, or fold to upper case. To deal with this when specifying and matching column names, where possible, actionETL uses the following approach:

    Important

    Column name matching is ordinal case insensitive, but a case sensitive match takes precedence over a case insensitive match.

    This allows matching a column name "MyColumn" even if there is another column name that only differs in case, e.g. "mycolumn". It also simplifies matching columns between systems with different casing policies.

    Column Schemas

    A public row field can also be a struct that groups columns, and creates reusable partial schemas (see IsSchema), which can be combined with columns and other partial schemas. These column schemas can also be used in multiple levels, i.e. a struct within a struct, up to 10 levels deep.

    Note
    • The column schema itself must be:
      • Stored in a public field (properties are only allowed for columns, i.e. at the lowest level)
      • A struct and not Nullable<T> and not a multi-copy or Always Column type
    • Multi-copy data types are always columns, even when stored in a field. This is also true for:
      • BitArray, Dictionary<String,String>, IDictionary<String,String>, IPAddress, ValueTuple<Net.IPAddress,Int32>, ValueTuple<Net.IPAddress,Int32>?
      • Data provider-specific data types specified as Always Column
    • struct as a reusable schema is generally more flexible than using inheritance. In certain scenarios however, the latter can require slightly less code.

    This example uses a hierarchy of column schemas:

    public struct PersonNameParts
    {
        public string FirstName;
        public string MiddleName;
        public string LastName;
    }
    
    public struct PersonName
    {
        public PersonNameParts NameParts;
        public string FullName;
    }
    
    public class Person
    {
        public int Age;
        public PersonName Name;
    }
    
    // ...
    
    var person = new Person();
    person.Name.NameParts.FirstName = "Kevin";
    
    var person2 = new Person();
    person2.Name = person.Name;
    
    // ...
    
    var transform = source.Output.Link.SortTransform(
        "Sort", cb => cb.Asc("FirstName"));
    

    Also see the Slowly Changing Dimension Example, which uses a column schema to store and look up values using a composite business key.

    Key Points

    In a dataflow row:

    • A member in a dataflow row or a column schema is a column if either it's a public property, or it's a public field that is not a column schema.
    • A public field is a column schema if it is a struct and it's not Nullable<T> and it's not a multi-copy or Always Column type.
    • Columns and column schemas are accessed via the normal member dot '.' notation.
    • By using a struct column schema, multiple columns can be shallow copied as a group, as in: person2.Name = person.Name; in the above example. This reduces code size and improves performance (compared to copying columns individually).
      • Note that some column types require deep copying; do use the built-in row copier to let the system pick the correct approach
    • Some workers (like SortTransform) allow specifying columns (here an ascending sort column) as a string. The string can also use dot notation cb => cb.Asc("Name.NameParts.FirstName"), but if the name is still unique within the row, it is sufficient to specify the trailing name parts (i.e. "NameParts.FirstName" or "FirstName").
    • Column name matching is ordinal case insensitive, but a case sensitive match takes precedence over a case insensitive match.
    • Column schemas like PersonNameParts and PersonName above can be reused in other row types, e.g. in a Contact class, which would simplify joining the two row types, copying between them, and maintaining them

    See Also

    • Dataflow
      • Dataflow Rows
      • Dataflow Blocking and Row Buffering
      • Dataflow Column Mapping
      • Compare Dataflow Columns
      • Dataflow Row Errors
      • Dataflow Lookups
    In This Article
    Back to top Copyright © 2023 Envobi Ltd