By:Nai Biao Zhou|Updated: 2023-05-04 |Comments| Related:Mero>data storage
Problem
According to Gartner, small and medium-sized businesses (SMBs) typically have fewer than 999 employees. IT resources, such as budget and staff, are often very limited (Gartner, 2023). Ideally, a data warehouse project team should be made up of professionals from people from different departments with different skills to take on different roles. However, it is sometimes not practical for some SMEs to form such a project team. These SMBs using Microsoft products are looking for a data warehouse architecture that allows some software developers to create and maintain a data warehouse.
Solution
A data warehouse architecture describes a flow of data from data sources to business users (Kimball et al., 2008). Due to the limitations of SMBs, we want to design a data warehouse architecture that allows us to build the data warehouse incrementally. In this way, we can make a comprehensive plan for a data warehouse and build iterations. During each iteration, we add enough features, no more, no less, to satisfy users while continuing to follow the data warehouse roadmap. Meanwhile, data warehouse architecture evolves along with changes in business, technologies, and team members. Therefore, all components of the architecture must be replaceable. After many iterations, we will have a mature and well-defined data warehouse.
Drawing inspiration from Pandya and Shah's local data mart approach (Pandya & Shah, 2014) and reporting operational data warehouses (Kimball et al., 2008), we adopt a hybrid of the standalone data mart (IDM) and bus architecture. from dating mart.Figure 1displays a high-level overview of the logical architecture. The logic model specifies that business users can access data landing areas and application data marts. At the same time, they can view subject and company level data from integrated data marts. The architectural plan is not set in stone. We must adapt to the changes in the world and review the architecture plan frequently.
Figure 1A logical model of high-level data warehouse technical architecture
The logic model does not attempt to reflect the meaning of the component; instead, the model illustrates the arrangement of these components. The proposed logic model contains eight components:
(1)Data transfer services:Transfer files from data sources (including production data, internal data, archived data, and external data) to the data storage environment.
(2)Data Landing Area:It acts as the first delivery point for incoming data (Schraml, 2020).
(3)Data extraction services:Denormalize the data in the data landing pad, and then load it into a relational database.
(4)Application data marts:Contains the denormalized data at the application level.
(5)ELT Services:Perform data extraction, loading, and transformation to produce integrated topic-driven data.
(6)Integrated data marts:Contains the topic-oriented data.
(7)Security:Proactively protect information from unauthorized access.
(8)Metadata:It acts as a table of contents for the data in the data store, helping users understand the content and find data.
A data warehouse project is human intensive (Ponnian, 2001). Due to a lack of IT resources, we started with a simple architecture and let the company manage the data warehouse. Eventually, the data warehouse matures and can consolidate relevant data from multiple sources. Although we use Kimball's approach to design the architecture, there are some differences between Kimball's approach and ours:
(1) The Kimball approach actively interviews business users to collect rules. Rather, our approach passively waits for business users to tell us what they need.
(2) Kimball's approach starts with a single business process, whereas ours starts with a single business process in a working application.
(3) When loading data into application data marts, we most likely do not perform data cleansing and transformation because the business rules may be unknown. However, business users can analyze this data because self-service BI tools can efficiently process data today.
(4) We conditionally allow commercial users to access the "back room", that is, the data landing area and the data marketplace of the application. Rather, Kimball's approach allows only business users to access the "front room" presentation area.
Each business has specific requirements and resources; therefore, the data warehouse architecture varies from company to company. In this tip, we first briefly introduce the process and major components of creating a data warehouse. The article then explores five predominant architectures. As these architectures evolve, they become more similar (Ariyachandra & Watson, 2006). For example, the Kimball and Inmon approaches agree that the presentation layer data should be in the dimensional model (George & Jeyakumar, 2019).
1 - Construction of a data warehouse
Software developers in SMBs may have limited knowledge and experience building a data warehouse. Therefore, when launching a data warehousing project, they must undergo a radical change in their roles and adjust the way they think about building applications. For example, they may be used to developing applications or reports to provide information to business users. In contrast, the data warehouse project asks them to create an information delivery system that allows business users to access data independently (Ponnian, 2001). To build a successful data warehouse, they must clearly understand that both users and business drive development, and they must be humble in the presence of business users (Kimball et al., 2008).
We can use Kimball's "plan big, build small" approach when working on a data warehousing project. Therefore, the initial project implements only a small part of the data warehouse project. Rather than reinventing the wheel, we use the Kimball Lifecycle Framework to guide each step in building a successful data warehouse. The framework provides a clear picture of the data warehouse development lifecycle.
1.1 Planning the data warehouse project
A data warehouse project begins with the preparation of a project plan. First we need to assess readiness for the data warehouse project. We need to make sure that the company needs a data warehouse and that senior management has realized that a data warehouse can help solve business problems. Otherwise, a data warehousing project is likely to fail without commercial backers. Furthermore, a culture of data-driven decision-making in the company also leads to building a successful data warehouse.
Even if we want to build a large data warehouse for the entire enterprise, we need to implement a small part in the initial iteration. In this case, we have a clear objective and realistic expectations. For example, the initial data warehouse is intended to create a few application data marts. With this project scope, we can staff the project, estimate the project schedule, and get approval from senior management.
1.2 Gathering business requirements
Business users and requirements, not developers and technology, drive the data warehouse project. Therefore, developers must resist the temptation to build smart components that tap into the depths of technology. Instead, they should focus on the information that business users need (Ponnian, 2001). Ideally, we should have precise business requirements prior to data warehouse design and implementation. However, the business world is changing rapidly. As a result, the way information is used in the data warehouse is unpredictable. Business users cannot view the data warehouse requirements until they use the data warehouse. On the other hand, an SMB data warehouse team may not be able to arrange many interviews with business users to qualify.
However, the business requirements are the main guide for the data warehouse project, and we must have a comprehensive understanding of the business to build a data warehouse. We can collect high-level business requirements from industry best practices. Internal company reports and documents are another valuable resource for discovering business requirements. In addition, operational data can help explain business processes and indicate some measures of company performance.
Furthermore, managers tend to view business performance metrics in terms of business dimensions (Ponnian, 2001). They know what numerical numbers can indicate business success. They then ask what factors affect these numbers. For example, a sales manager at Adventure Works Cycles (Microsoft, 2022) is interested in sales revenue by product category in a specific month and by promotion in a specific sales area. Product category, month, promotion, and sales area are business dimensions, and sales revenue is the numerical figure to measure sales performance. We can determine detailed business requirements for a data warehouse with dimensional analysis.
1.3 Data warehouse architecture
We use a data warehouse architecture, which represents the overall design and structure of a data warehouse, to explain how we implement business requirements. Business requirements are the primary guide for designing a data warehouse architecture. Since each company may have unique business requirements, the data warehouse architecture may vary from company to company. However, a data warehouse architecture must have the following characteristics (IntecrviewBit, 2023):
- Separation:Analytical and transactional processing should be suspended.
- Scalability:With the increasing volume of data and the number of users, data warehouses must be able to process the data with the least possible impact on the existing infrastructure.
- Expansion Options:The architecture had to be able to handle the changes in the world and bring new features to business users without redesigning the entire system.
- Security:Each component of the architecture must implement security strategies.
- Manageability:Data warehouse management should be easy.
There are many types of data storage architecture. However, we can classify them into two basic groups: the hub-and-spoke architecture and the data mart bus architecture. Bill Inmon and Ralph Kimball are pioneers in data warehouse development and advocate both of these architectural approaches. Both approaches have been equally successful in recent decades for their intended purposes (Rangarajan, 2016). However, when an enterprise is willing to adopt the "plan big, build small" strategy, we recommend using the datamart bus architecture, ie. Kimball's approach.
1.4 Design and implementation of the database
Business users can naturally think of business metrics in terms of business dimensions and analyze metrics along those dimensions. Therefore, a proposed data model should include these business dimensions and metrics. We can use the dimensional modeling technique to design the data model. A dimensional data model is structured to provide maximum query performance and ease of use. A typical dimensional model consists of a fact table surrounded by dimension tables. This data structure is often called a star schema (Zhou, 2020).
A data warehouse consolidates relevant data from different sources, and the original data can be in different formats. To simplify the design of the database, we classify the data into three categories (Moody & Kortink, 2003):
- Transaction units:Record the output of a single business process. The output contains metrics (for example, sales revenue) that measure performance. Units must also contain attributes (for example, sales area) that describe and define the process.
- Component Units:Describe and define the business process. They answer who, what, when, where, how, and why questions about the process.
- Classification units:Classify a component entity, e.g. ProductCategory and ProductSubcategory. These entities can define hierarchies in the data model to drill down or summarize in data analysis. In addition, they can categorize a component unit.
After classifying the units we can design the star scheme. We designate transaction entities as fact tables and component entities as dimension tables. We typically collapse class entities into component entities to get flat dimension tables with single partition keys that connect directly to the fact table. Fact tables must be in 3NF and dimension tables are usually in 2NF and possibly 3NF, but cannot be in 1NF. It is worth noting that the mapping of a transaction entity to a fact table is not always one-to-one, nor is the correspondence between component entities and dimensions always one-to-one.
We typically create conceptual, logical, and physical models when we design and implement a database. We first create a conceptual data model and then work on a logical one. Finally, we produce a physical data model after the logical data model is approved. However, in a data warehouse project, the conceptual and logical data models are sometimes considered a single product (Zhou 2018).
1.5 Design and implementation of the ELT
Copies of the raw data are in the data landing area, and the destination is the data mart, which has a collection of database tables in either 2NF or 3NF. Developers typically use the Extract, Transform, and Load (ETL) system to transport data from the landing pad to the data mart. According to Kimball, building an ETL system can take 70 percent of the time and effort of building a DW/BI environment (Kimballet al., 2008). In summary, an ETL system performs these three tasks sequentially:
- Extraction:Extract data from multiple sources, including structured and unstructured data, into a consistent, consolidated data structure.
- Transformation:Apply business rules and transform data into useful information. This task can include deduplication, standardization, aggregation, discretization, generalization, integration, and cleanup.
- Charging:Moves the data to a destination data store. Usually, there are two ways to upload: full update and incremental upload. We typically use the full update approach for payload component entities. This method truncates the database tables and then loads all the data. The incremental load method loads the delta (or difference) between the data source and the destination.
However, the architecture proposed in this document uses a different system, ie ELT or Extract, Load, Transform. We extract the data and load it into the warehouse immediately. The new set of data structures, which are not yet a dimensional model, add some capabilities to the original structures and provide value to business users for various analytical purposes. In this case, we don't transform the data until business users analyze the data and find exact data in specific structures. Then we have your business requirements and add features to the data transformation layer. In addition, today's self-service BI tools are powerful and can perform data transformation on the fly.
1.6 Implementation
We should have development, test, and acceptance environments to build a data warehouse. When we get to the implementation phase, we are done testing all the features that the data warehouse can provide. We should allow business users to run tests in the acceptance environment. They use all of their tools to perform data analysis and compare the analysis results of the operating systems with the results of the data warehouse. They can assess the complexity and performance of queries. We can only deploy the data warehouse to the production environment when business users are satisfied with the test results. We should invite users from different groups to test the data warehouse. While they test the data, we test the security. Data warehouses should only allow users to see the data they have access to.
A data warehouse is a combination of many technologies and components. The initial project must implement all the components; therefore, we adopt the divide and conquer strategy. We can package the deployment into manageable parts and deploy them independently with a deployment plan. In subsequent projects, we can use this plan as a checklist. We must also develop a backup and recovery plan to ensure disaster recovery readiness.
1.7 Maintenance
Our strategy is to plan big and build small. We expect business users to tell us what they need. If a data warehouse doesn't add value to business users, they will walk away and look for alternative solutions. Eventually, a data warehouse becomes a data warehouse. Therefore, maintaining a healthy data warehouse is critical to building a successful one. Hardware and software resources often provide statistics to show usage information. In addition, business users bring us their comments and demands. Therefore, collecting statistics and maintaining healthy relationships with business users is critical to maintaining a data warehouse. In addition, data access control is routine maintenance.
Future iterations will add new data marts to the data warehouse in future iterations. We will probably retire some data marts. Therefore, we must keep business users informed and help them update their inquiries accordingly. On the other hand, custom dimensions are a fundamental element of Kimball's approach. We can weave star schemes through custom dimensions - we should use custom dimensions whenever we can. When we integrate dimensions from multiple sources into a custom dimension, we must notify the development team and business users. We must keep track of changes to the data warehouse and provide the data warehouse community with updated documents.
2 – Comparison of data warehouse architecture
We usually use two classification methods to classify data storage architectures. The first method uses an N-tier architecture to describe a data storage architecture. This method divides a data store into logical layers and physical layers. For example, a data warehouse typically uses a three-tier architecture (Han et al., 2012). Some data warehouses use a five-tier Business Intelligence architecture (Ong et al., 2008).
The second classification method (adopted in this article) characterizes the arrangement of individual components, especially these data warehouses: central storage and data marts. There are five commonly recognized architectural styles: Standalone Data Marts, Data Mart Buses, Hub-and-Spoke, Centralized Data Warehouses, and Federated (Singh & Malhotra, 2011). There are other architectures, but they tend to be variations of these five. For example, the Data Vault Model (DVM), invented by Dan Linstedt, is based on hub-and-spoke (Inmon et al., 2019). It is worth noting that Bill Inmon favored the hub-and-spoke architecture (for example, Corporate InformationFactory), while Kimball favored the datamart bus architecture.
2.1 Independent data marts
A data mart is a subset of a data warehouse, typically consisting of a single subject area (for example, finance, human resources, and marketing). A standalone data mart is a small storage designed for individual devices or specific applications. We can create independent data marts at a lower price and in a shorter time. However, because of their independence, they may have inconsistent data and may not provide "a single version of the truth," making it difficult to analyze data at the enterprise level.Figure 2illustrates the standalone data mart architecture.
Figure 2 Independent Data Mart Architecture
2.2 Data center bus
Kimball created the data mart bus architecture in the 1990s by offering an innovative approach to building a data warehouse. This architecture, as shown infigure 3, is based on dimensional modeling, which involves these concepts: dimensions, facts, custom dimensions, and bus matrix. A fact table and dimensions build a star schema that forms a data mart. A union of all well-defined data marts builds a corporate data warehouse. The BUS (ie custom dimensions) facilitates communication between these data marts.
We call the approach to adopt this architecture bottom-up or user-driven. This architecture allows us to build a data warehouse incrementally. For example, we are building the first data mart for a single business process. Then we look at other processes and build more data marts. At each iteration, we go through a tuning process to create custom dimensions. Additional marts should use these custom dimensions, resulting in logically integrated marts and a business view of the data.
figure 3Data center bus architecture
2.3 Hub and spokes
The hub-and-spoke architecture, as shown inFigure 4, includes a centralized data warehouse (highly normalized and often in 3NF) and multi-departmental data marts. We physically separate the central data warehouse and the data mart. These data marts are dependent data marts because we create them from the centralized data warehouse. Bill Inmon, "the father of the data warehouse," advocated this architecture.
We call the approach to using this architecture the top-down or data-driven approach. We create a centralized data warehouse, collect data from sources, transform it, and load it into the warehouse without knowing any end-user criteria or requirements. divide data into data marts to meet the needs of different business groups.
Figure 4La arquitectura Hub-and-Spoke
2.4 Centralized data warehouse
The centralized data warehouse architecture, as shown inFigure 5, similar to the hub-and-spoke architecture, but without dependent data marts. Users get access to the giant enterprise data warehouse.
Figure 5The centralized data warehouse architecture
2.5 Connected
A large company may be made up of several regional departments, each of which has built a data warehouse. On the other hand, a company may have multiple data warehouses built for different functions or applications. In this case, we can use a federated architecture, as shown inFigure 6, to integrate at least some of the data. In a federated architecture, we don't change existing data stores; business users can access these repositories when needed. Meanwhile, we can use shared keys, global metadata, distributed queries, and other methods to logically or physically integrate data (Alsqour et al., 2012).
Figure 6federated architecture
The choice of data warehouse architecture has a critical impact on project planning and implementation (Dymek et al., 2015). If resources are limited, we can start with standalone data marts and then let business requirements drive architecture development. Standalone data marts can adequately serve business users if they just want to do descriptive analysis to respond to what happened. However, when users perform advanced analytics (such as diagnostic, predictive, and prescriptive analytics), they may need enterprise-grade data. The architecture proposed in this document is a hybrid of the data mart and standalone mart bus architectures, which can satisfy different levels of business analysis.
Resume
A data warehouse is a copy of transactional data structured specifically for query and analysis (Kimball & Ross, 1996). A data warehouse can help companies make decisions. Adopting the right architecture is critical to building a successful data warehouse. This article proposes an architecture for small and medium businesses with very limited IT resources. The proposed architecture allows us to implement Kimball's "plan big, build small" strategy. Furthermore, the architecture allows developers to implement a new data mart at a lower cost and in a shorter time.
We start with the proposed data storage architecture. Next, we compare our architecture with Kimball's architecture. We pointed out that we could start with a simple architecture and let the company drive the development of the data warehouse. Below, we explore seven important components of building a data warehouse. Below, we briefly cover the five predominant architectures. These architectures are not mutually exclusive. Our architecture is evolving and it is possible to use components from other architectures and arrange them in the same way as the others.
Reference
Alsqour, M., Matouk, K. y Owoc, Mieczyslaw. (2012).A Survey of Data Warehouse Architectures: Preliminary Results. Federated Computer Science and Information Systems Conference 2012, FedCSIS 2012. 1121-1126.
Ariyachandra, T. & Watson, JH (2006).Which data storage architecture is most successful?Business Intelligence Magazine 11(1): 4.
Dymek, D., Szwed, P. y Komnata, W. (2015).Proposal for a new data warehouse architecture reference model. https://www.researchgate.net/publication/277476655
Gartner (2023).Small and medium enterprises (SMEs). https://www.gartner.com/en/information-technology/glossary/smbs-small-and-midsize-businesses.
George , J. and Jeyakumar , K. M. (2019).A comparative study of data storage architectures: top-down and bottom-up. IJSTE, Volume 5, Number 9, 008.
Han, J., Kamber, M. y Pei, j. (2012).Data mining concepts and techniques. Waltham, MA: Morgan Kaufmann.
Inmon , HWH , Linstedt , D. and Levins , M. (2019).Data Architecture: A Basic Manual for the Data Scientist, 2. udgave. Cambridge, MA: Academic Press.
Interview Bit. (2023).Data Warehouse Architecture: Detailed Explanation. https://www.interviewbit.com/blog/data-warehouse-architecture/.
Kimball, R., Ross, M, Thornthwaite, Mundy, J. y Becker, B. (2008).Data Warehouse Lifecycle Toolkit. . . . Hoboken, New Jersey: John Wiley & Sons.
Kimball, R. y Ross, M. (1996).Data Warehouse Toolkit. Hoboken, Nueva Jersey: JohnWiley & Sons.
Microsoft.(2022).Analyze Services tutorial scenario. https://learn.microsoft.com/en-us/analysis-services/multidimensional-tutorial/analysis-services-tutorial-scenario.
Moody D. L. y Kortink M. A. R.,From ER models to dimensional models: bridging OLTP and OLAP design, Journal of Business Intelligence, enlace 8, 2003.
Ong, L.I., Siew, H.P. y Wong, F. S. (2011).A five-layer Business Intelligence architecture. vol. 2011 (2011), Artikel ID 695619, IBIMA Communications: IBIMA Editorial.
Pandya, B y Shah, S. (2014).Local data mart approach proposal for data warehouse architecture. IJETAE, Volume 4, Number 2.
Ponnian, P. (2001).Data Storage Fundamentals: A Comprehensive Guide for IT Professionals.Hoboken, Nueva Jersey: Wiley InterScience.
Rangarajan, s. (2016).Data Warehouse Design: Inmon vs. Kimball. https://tdan.com/data-warehouse-design-inmon-versus-kimball/20300
Schraml, T. (2020).Staging Areas vs. Landing Areas in Analytics Environments. https://www.dbta.com/Columns/Database-Elaborations/Staging-Versus-Landing-Areas-in-Analytics-Environments-143425.aspx
Singh, S. y Malhotra, S. (2011).Data warehouse and its methods. Journal of Global Research in Computer Science, link 2 nr. (5), May 2011.
Yessad, L. y Labiod, A. (2016).Comparative Study of Data Warehouse Modeling Approaches: Inmon, Kimballand Data Vault. 2016 International Conference on Systems Reliability and Science (ICSRS), Paris, Frankrig, 2016, s. 95-99, doi: 10.1109/ICSRS.2016.7815845.
Zhou, N. (2018).Create a star schema data model in SQL Server using the Microsoft toolset.https://www.mssqltips.com/sqlservertip/5614/explore-the-role-of-normal-forms-in-dimensional-modeling/.
Zhou, N. (2020).Explore the role of normal shapes in dimensional modeling.https://www.mssqltips.com/sqlservertip/5614/explore-the-role-of-normal-forms-in-dimensional-modeling/.
Next step
- A data warehouse is a combination of technologies and components. The article only covered some basics of building a data warehouse. The high-level logical data storage architecture model proposed in this article can guide the creation of a layered data storage architecture, including a detailed architecture design. The author recommends creating a simple architecture when a software developer needs to build a data warehouse. The developer can then read Ponniah's book "Data Warehousing Fundamentals for IT Professionals" to learn data warehousing principles. Then they can read Kimball's books, p. "The Data Warehouse Lifecycle Toolkit", "The DataWarehouse Toolkit", and "The Data Warehouse ETL Toolkit" in order.
- Check out these related tips:
- Create a star schema data model in SQL Server using the Microsoft toolset
- Explore the role of normal shapes in dimensional modeling
- What is a relational database management system?
- Create an extended date dimension for a SQL Server data store
- Loading historical data into a SQL Server data warehouse
- How to use fact tables without data in a data warehouse Business IntelligenceSolution
- Infrastructure planning for a SQL Server data warehouse
- Surrogate key vs. natural key differences and when to use in SQL Server
About the Author
Nai Biao Zhou is a senior software developer with more than 20 years of experience in software development, specializing in data warehousing, business intelligence, data mining, and solution architecture design.
See all my tips