Adoption of Agile Methodology in Data Projects

Data Process

(Role Playing by: AshayT, SanketS, KanchanP, BimanS & NeerajS; Photos by: ManishS; Design by: NimeshD; Concept by: PramodR)

Before discussing about agile methodology in data projects, first let’s briefly discuss about the nature of data projects. In almost all data projects, there are mainly three steps, popularly known as ETL,: Extraction (E), Transformation (T), and Loading (L). At Deerwalk, the three main steps of data projects are data import, data mapping, and application processing, and an analogy can be drawn between these three steps with ETL.

First, raw data of clients is imported to our import tables. The imported data is then cleaned, mapped, and transported to Deerwalk standard scrub tables with necessary business logic implemented. For mapping, Data Standardization Document (DSD) is prepared which contains information on required business logics and on how each field of source tables will be mapped with each field of scrub tables. For data projects, DSD is like detail design used in software development projects. On the basis of DSD, scripts are written to convert client specific data into Deerwalk’s standard format. The data is then processed to present them in our applications.

The above mentioned steps are complemented by Data Import Review (imported data is compared with control total), DSD Review (DSD is reviewed to verify logics/mapping) and Scrub Review (unit testing performed by developers). In addition, Data Scrub QC is performed by independent QC resources to identify any defects in this phase. Some issues found in these steps require us to go back to the client for their feedback and might also result in re-import and re-scrub of data. Until and unless all issues get resolved, one can expect to have much iteration in data projects.

Data projects require significant participation of clients with the team throughout the project. This is the main reason behind adopting agile methodology in data projects. In inception phase, several iterations take place with continuous communication between stakeholders inside the company and outside the company. This continuous communication also helps to manage changes in requirements more effectively.

Agile software development methodology used in data projects can be explained through the following flow chart:

Data Process Flow Chart

Data Process Flow Chart

Agile process in data projects can be divided into following level of planning:

a) Implementation Planning
To implement any client, project manager needs information like number of data feeds / employer groups associated with particular client, data carriers, and data files associated with different feeds. This information is necessary to define timeline and resources to implement any client.

b) Sprint Planning
The implementation of client is broken down into sprints. Generally, the number of sprints depends upon the number of data feeds. The duration of a sprint is fixed but it usually takes 4 weeks to implement a particular data feed. Each sprint is divided into a number of iterative sub-phases.

c) Iteration Planning
The nature of data project is such that the PM needs to plan for each iteration. From the flow chart as well, we can see that iteration occurs mainly in Data Import, DSD Preparation and Data Scrubbing phases. An iteration which take place in Data Scrubbing phase is mainly due to defects. To minimize re-scrubbing, DSD Review and Scrub Review should be done properly.

d) Daily Planning
Every morning, the team meets for a quick stand up meeting to discuss about previous day’s development, tasks which will be worked upon on that particular day and any impediments.

Like in every other discipline of project management, the focus of data project managers is also finishing project in specified time and making best use of available resources. In data projects, the primary goal is cleaning and standardizing client’s data. Data sent by clients depends upon the data carriers. There is no hard and fast rule to implement logic to standardize data for all clients. We need to modify logic according to client’s data. The most common example is memberid. We need to make memberid unique for each member but there is no fixed formula which we can follow to make it unique. Similarly, level of aggregation is always different for different clients.

So, continuous interaction with clients is of utmost importance for successful implementation of data projects. Non-iterative methodologies, like waterfall model, are very much unsuitable for data projects and, hence, we have adopted Agile methodology in data projects at Deerwalk.

Some Feedbacks

It is really a very good article Sanket. Thanks for the article. Well written.

However, i have few concerns:

1. QC in data projects may redirect the flow back to import or may be back to the clients if there are issues identified in the layout. For instance, mistakes in imports may not be captured in the review, even with the matched control totals, which may later be identified in QC and may cause redo of the whole import process. QC, not limited to data project, may sometime end up with the redo of the whole life cycle. So, little bit concerned about the flow chart of the process.

2. Though you have been specific about data projects in deerwalk in some sentences (particularly for Makalu), sometimes it appears in the article that you are generalizing this process to almost all data projects. Correct me if i have misunderstood your theme. ETL in many data projects are just the beginning. For instance, claims or encounters processing in health plans, they have much more to do beside just Extract, Transform and Load. In fact, ETL is just the beginning. Who knows, may be tomorrow deerwalk will deal with such processing projects. So, title of "Adoption of Agile Methodology in Data Projects" is generalized but the body and flowchart is limiting deerwalk projects within "Makalu process". This article posted in website of company is representing the company's skill set on data projects. Limiting deerwalk to ETL boundary is like closing doors for other projects where ETL is just a small part. I think title should have been "Adoption of Agile Methodology in Makalu, a deerwalk product"

There are many mountains beside Makalu.

Well wisher

Thanks for your feedback.

Thanks for your feedback. Here, I have only try to draw analogy between ETL and process used in deerwalk because ETL is more generic term. And, yes this article is mainly based on process used in deerwalk

Nice approach

The way process flow shown in flow diagram is impressive.I see that in the intermediate stages there are frequent reviews with analysis and loop back to previous stage in case answer is 'NO'. I liked this methodology. My little concern is are these reviews associated with end users. If yes, is that necessary to follow complete agile process for these kind of data projects. I presume that clients have no interest with the internal data representation of DW. So, I think that hybrid of agile and typical waterfall model would have been better solution.

funny

this is funny

The review is done by team

The review is done by team members within the team. During review, the identified issues which can't be solved by people within the company are communicated to client for their feedback.