Big Data Lifecycle Management
A simplified overview for beginners
In this article, I want to provide an architectural overview of the Big Data lifecycle management based on my experience in the field. Understanding this process is essential to architect and design Big Data solutions.
Big data is different from traditional data. The main differences come from characteristics such as volume, velocity, variety, veracity, value and overall complexity of data sets in a data ecosystem. Understanding these V words provide useful insights into the nature of Big Data.
There are many definitions in the industry and academia for Big Data; however the most succinct yet comprehensive definition which I agree comes from the Gartner: "Big data is high-volume, high-velocity, and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making".
The only missing keyword in this definition is the 'veracity'. I'd also add to this definition that these characteristics are interrelated and interdependent.
Introduction to Big Data Lifecycle Management
As Big Data solution architects, we need to understand the data lifecycle management process, as we are engaged in all phases of the lifecycle as a technical leader.
Our roles and responsibilities may differ in different phases; however, we need to be on top of the lifecycle management from an end to end perspective.
From an architectural and solution design perspective, a typical Big Data solution, similar to traditional data lifecycle, can include a dozen of distinct phases in the overall data lifecycle solution process.
Big Data solution architects are engaged in all phases of the lifecycle, providing different input and producing different output for each phase.
These phases may be implemented under various names in different data solution teams.
There is no rigorous universal systemic approach to the Big Data lifecycle in the industry as the field is still evolving.
The common approach is that experience from the traditional data management are transferred and enhanced for particular solution use cases.
For awareness and guiding purposes to the aspiring Big Data architects, I propose the following distinct phases. In several successful data architecture projects, I successfully used this template to ensure the lifecycle is properly covered in the solutions.
Phase 1: Foundations
Phase 2: Acquisition
Phase 3: Preparation
Phase 4: Input and Access
Phase 5: Processing
Phase 6: Output and Interpretation
Phase 7: Storage
Phase 8: Integration
Phase 9: Analytics and Visualisation
Phase 10: Consumption
Phase 11: Retention, Backup, and Archival
Phase 12: Destruction
Let me provide you with an overview of each phase with some guiding points. You can customise the names of these phases based on the requirements and organisational data practice of your Big Data solutions.
The key point is that these names are not set in stone and provided only as guidance.
Phase 1: Foundations
In data management process, the foundation phase includes various aspects such as understanding and validating data requirements, solution scope, roles and responsibilities, data infrastructure preparation, technical and non-technical considerations, and understanding data rules in an organisation.
This phase requires a detailed plan facilitated ideally by a data solution project manager with substantial input from the Big Data solution architect and some data domain specialists.
A Big Data solution project includes details such as plans, funding, commercials, resourcing, risks, assumptions, issues, and dependencies in a project definition report (PDR). Project Managers compile and author the PDR; however, the solution overview in this critical artifact is provided by the Big Data Architect.
Phase 2: Data Acquisition
Data Acquisition refers to collecting data. Data sets can be obtained from various sources. These sources can be internal and external to the business organisations.
Data sources can be in structured forms such as transferred from a data warehouse, a data mart, various transaction systems, or semi-structured sources such as Weblogs, system logs, or unstructured sources such as coming from media files consist of videos, audios, and pictures.
Even though data collection is conducted by various data specialists and database administrators, the Big Data architect has a substantial role in facilitating this phase optimally.
For example, data governance, security, privacy, and quality controls start with the data collection phase. Therefore, the Big Data architects take technical and architectural leadership of this phase.
The lead Big Data solution architect, in liaison with enterprise and business architects, lead and document the data collection strategy, user requirements, architectural decisions, use cases, and technical specifications in this phase.
For comprehensive solutions of sizable business organisations, the lead Big Data architect can delegate some of these activities to various domain architects and data specialists.
Phase 3: Data Preparation
In the data preparation phase, the collected data - in raw format- is cleaned or cleansed - these two terms are interchangeably used in different data practices of various business organisations.
In the data preparation phase, data is rigorously checked for inconsistencies, errors, and duplicates. Redundant, duplicated, incomplete, and incorrect data are removed. The objective is to have clean and useable data sets.
The Big Data solution architect facilitates this phase. However, most data cleaning tasks, due to granularity of activities, can be performed by data specialists who are trained in data preparation and cleaning techniques.
Phase 4: Data Input and Access
Data input refers to sending data to planned target data repositories, systems, or applications.
For example, we can send the clean data to determined destinations such as a CRM (Customer Relationship Management) application, a data lake for data scientists, or a data warehouse for use by specific departments. In this phase, data specialists transform the raw data into a useable format.
Data access refers to accessing data using various methods. These methods can include the use of relational databases, flat files, or NoSQL. The NoSQL is more relevant and widely used for Big Data solutions in various business organisations.
Even though the Big Data solution architect leads this phase; they usually delegate the detailed activities to data specialists and database administrators who can perform the input and access requirements in this phase.
Phase 5: Data Processing
Data Processing phase starts with processing the raw form of data. Then, we convert data into a readable format giving it the form and the context. After completion of this activity, we can interpret the data using the selected data analytics tools in our business organisation.
We can use common Big Data processing tools such as Hadoop MapReduce, Impala, Hive, Pig, and Spark SQL.
The popular real-time data processing tools in most of my solutions were HBase, and the near real-time data processing tool was Spark Streaming. There are many open-source and proprietary tools on the market.
Data processing also includes activities such as data annotation, data integration, data aggregation, and data representation.
Phase 6: Data Output and Interpretation
In the data output phase, the data is in a format which is ready for consumption by the business users. We can transform data into usable formats such as plain text, graphs, processed images, or video files.
The output phase proclaims the data ready for use and sends the data to the next phase for storing. This phase, in some data practices and business organisation, is also called the data ingestion. For example, the data ingestion process aims to import data for immediate use or future use or keep it in a database format.
Data ingestion process can be in a real-time or in a batch format. Some standard Big Data ingestion tools that were commonly used in my solutions were Sqoop, Flume, and Spark streaming. These are popular open-source tools.
One of the activities is to interpret the ingested data. This activity requires analysing ingested data and extract information or meaning out of it to answer the questions related to the Big Data business solutions.
Phase 7: Data Storage
Once we complete the data output phase, we store data in designed and designated storage units. These units are part of the data platform and infrastructure design considering all non-functional architectural aspects such as capacity, scalability, security, compliance, performance and availability.
The infrastructure can consist of storage area networks (SAN), network-attached storage (NAS), or direct access storage (DAS) formats. Data and database administrators can manage stored data and allow access to the defined user groups.
Big Data storage can include underlying technologies such as database clusters, relational data storage, or extended data storage, e.g. HDFS and HBASE, which are open source systems.
In addition, the file formats such as text, binary, or other types of specialised formats such as Sequence, Avro, and Parquet must be considered in data storage design phase.
Phase 8: Data Integration
In traditional models, once the data is stored, it ends the data management process. However, for Big Data, there may be a need for the integration of stored data to different systems for various purposes.
Data integration is a complex and essential architectural consideration in Big Data solution process. Big Data architects are engaged to architect and design the use of various data connectors for the integration of Big Data solutions.
There may be use cases and requirements for many connectors such as ODBC, JDBC, Kafka, DB2, Amazon S3, Netezza, Teradata, Oracle and many more based on the data sources used in the solution.
Some data models may require integration of data lakes with a data warehouse or data marts. There may also be application integration requirements for Big Data solutions.
For example, some integration activities may comprise integrating Big Data with dashboards, tableau, websites, or various data visualisation applications. This activity may overlap with the next phase, which is data analytics.
Phase 9: Data Analytics and Visualisation
Integrated data can be useful and productive for data analytics and visualisation.
Data analytics is a significant component of Big Data management process. This phase is critical because this is where business value is gained from Big Data solutions. Data visualisation is one of the key functions of this phase.
We can use many productivity tools for analytics and visualisation based on the requirements of the solution. In my Big Data solutions, the most commonly used tools were Scala, Phyton, and R notebooks. Phyton was selected as the most productive tool touching almost all aspects of the data analytics especially to empower machine learning initiatives.
In your business organisation, there can be a team responsible for data analytics led by a chief data scientist. Big Data solution architects have a limited role in this phase however they closely work with the data scientists to ensure the analytics practice and platforms are aligned with business goals.
The Big Data solution architects need to ensure the phases of the lifecycle are completed with an architectural rigour.
Phase 10: Data Consumption
Once data analytics takes place, then the data is turned into information ready for consumption by the internal or external users, including customers of the business organisation.
Data consumption require architectural input for policies, rules, regulations, principles, and guidelines. For example, data consumption can be based on a service provision process. Data governance bodies create regulations for service provision.
The lead Big Data Solution Architect leads and facilitates the creation of these policies, rules, principles, and guidelines using an architectural framework selected in the business organisations.
Phase 11: Retention, Backup, and Archival
We know that critical data must be backed up for protection and meeting industry compliance requirements.
We need to use established data backup strategies, techniques, methods, and tools. The Big Data solution architect must identify, document, and obtain approval for the retention, backup, and archival decisions.
The Big Data solution architect may delegate the detailed design of this phase to an infrastructure architect assisted by several data, database, storage, and recovery domain specialists.
Some data for regulatory or other business reasons may need to be archived for a defined period of time. Data retention strategy must be documented and approved by the governing body, especially by enterprise architects, and implemented by the infrastructure architects and the storage specialists.
Phase 12: Data Destruction
There may be regulatory requirements to destruct a particular type of data after a certain amount of times.
The destruction requirements may change based on the industries.
You need to confirm the destruction requirements with the data governance team in business organisations.
Even though there is a chronological order for the life cycle management, for producing Big Data solutions, some phases may slightly overlap and can be done in parallel. Your organisation's proprietary method may require a certain order. You need to check with your method exponent in your organisation's data practice division.
The life cycle proposed in this article is only a guideline for awareness of the overall process. You can customise the process based on the structure of the data solution team, unique organisational data platforms, data solution requirements, use cases, and dynamics of the owner organisation, its departments, or the overall enterprise ecosystem.
This has been a quick overview of the Big Data lifecycle management using twelve phases. In my upcoming articles, I will introduce the Big Data solution components in further details.
Thank you for reading my perspectives.
If you liked this article, you may also enjoy the following articles:
The original version of this story was published in another platform under a different title.
Architecting Big Data & Analytics Solutions by Dr Mehmet Yildiz
A Modern Enterprise Architecture Approach by Dr Mehmet Yildiz
Big Data for Enterprise Architects by Dr Mehmet Yildiz
Architecting Digital Transformation by Dr Mehmet Yildiz
You are welcome to join my 100K+ mailing list, to collaborate, enhance your network, and receive technology newsletter reflecting my industry experience.