Enterprise ETL Framework vNext

Although its a little bit nuts, I’ve spent a fair amount of time lately daydreaming and brainstorming on what a distributed and highly scalable SSIS ETL framework capable of handling both on-premise and cloud based workloads would look like. Obviously, this framework is not for everyone but this post is about my initial thoughts and I would like to invite an open community dialog on the concept.

Tenets

When I started thinking about it, I set forth a few guiding principles:

  1. KISS – Keep it Simple Stupid. Okay so simple is relative. More specifically avoid complexity for the sake of complexity.
  2. Support both cloud and on-premise workloads with a single framework.
  3. Scale seamlessly with minimum set-up required.
  4. Convention over configuration. Reasonable defaults allow the framework to work without requiring complex configuration.
  5. Flexibility to run any SSIS package without changes.
  6. Visibility into framework through a standard set of REST APIs

Job Concept

The notion of a job within the framework represents a full process which contains one or more steps (i.e. packages). A job can be either full-defined in the request or can leverage a pre-defined job template whose metadata is kept in the job state store. Additionally, jobs can define dependencies or pre-requisites to other jobs allowing for complex orchestrations. Other aspects of the job include numerous preferences and configuration options such as performance, restart-ability and notifications.

Example Job Definition

{
    "jobName": "Test Job",
    "jobID": "{3F2504E0-4F89-41D3-9A0C-0305E82C3301}",
    "template": { "usage": "none" },    
    "tasks": {
        {
            "taskID": 1,
            "taskName": "Step 1",    
            "source": {
                "type": "Catalog"
                "project": "ETLLoader.ispac"
                "packageName": "DWLoader.dtsx"
            }
            "dependencies": {
                { }
            }
            "parameters": {
                {
                    "name": "SQLConnection",
                    "value": "...."
                }
            },
            "configuration": {
                "allowRetry": "false",
                "target": "OnPremise"
            }
        },
        {
            "taskID": 2,
            "taskName": "Step 2",    
            "source": {
                "type": "Catalog"
                "project": "ETLLoader.ispac"
                "packageName": "WebStatsLoader.dtsx"
            },
            "dependencies": {
                { "taskID": 1 }
            }
            "configuration": {
                "allowRetry": "true",
                "target": "Cloud"
            }
        }
    }
    "notifications": {
        "events": {
            { "eventType": "Completed" },
            { "eventType": "Error" }
        },
        "types": {
            {
                "type": "Email",
                { ... }
            }
        }
    }
}

Framework Concept

Below is a holistic view of what I’ve come up with. A brief discussion of the core piece follows.

ModernETL

  • Queues/Messaging – The various queues and/or messaging used through-out the framework would be implemented as Azure Service Bus topic/subscriptions and queues. Leveraging Service Bus allows for scalable and easy integration of the various parts/pieces of the framework.
  • Job Submission Agent – The submission agent monitors the incoming job queue, verifies that the request can be satisfied and generates the job in the job state store.
  • Job State Store – The job state store is an SQL Azure database. This database will be used to hold all job configuration and metadata including performance and runtime information. The store would also be set-up to hold job templates to allow for a single point of configuration for scheduled jobs.
  • Job Scheduling Agent – The scheduling agent tracks the state store for incoming job requests. It parses each request and handles the orchestration, workflow process and any business logic that must occur as part of the job. The scheduling agent does not handle reoccurring jobs or monitor jobs for health or status but does keep track of queue depth against a defined high-water mark.
  • Supervisor Agent – The supervisor agent is responsible for job health monitoring. The supervisor tracks job timeouts and failures and is responsible for implementing job retry logic if configured. The supervisor also keeps track of the other agents to ensure they are running as configured.
  • Heartbeat Agent – The heartbeat agent is an optional piece that is designed for organizations that use long running packages. To leverage this piece the SSIS packages must be instrumented to generate a heartbeat message. The agent monitors the heartbeat and updates the state store.
  • SSIS (On-Premise & Cloud) Agents – see below
  • REST API Interface – The REST API is a read-only interface that allows external applications to integrate with the framework for job history/data.
  • BLOB Storage – The framework would leverage a centralize package and assembly storage. All *.dtsx or *.ispac files and dependent assemblies would be stored in blob storage. Each SSIS Agent is capable of grabbing and installing the required resources.

SSIS Agents Concept

ModernETL2

Challenges/Open Questions

1. Security/Authentication – What is the best approach for securing the framework?

2. Logging – What is the best way to aggregate the SSIS logs from agents? Initial thoughts are to push to blob storage and leverage HDInsight to process/aggregate agent logs into meaningful information.

3. Notifications – Notifications should be pushed to a Service Bus Topic, should a Notification agent be used to publish notifications? How would notification be configured?

Next Steps

There are still a number of design issues to hammer out before any real work takes place. I would also welcome and thoughts, criticism and feedback. I am planning on developing this as an open-source project that will ultimately be available on codeplex.com. If you would like to contribute in anyway, please feel free to contact me directly at cprice at microsoft dot com.

Till next time!

Chris

Advertisements

2 thoughts on “Enterprise ETL Framework vNext

  1. Pingback: Building Custom SSIS Azure Service Bus Components | Bluewater SQL

  2. Pingback: Building Custom SSIS Azure Service Bus Components - Bluewater SQL - Site Home - MSDN Blogs

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s