Data engineering

Last updated

Data engineering refers to the building of systems to enable the collection and usage of data. This data is usually used to enable subsequent analysis and data science; which often involves machine learning. [1] [2] Making the data usable usually involves substantial compute and storage, as well as data processing.

Contents

History

Around the 1970s/1980s the term information engineering methodology (IEM) was created to describe database design and the use of software for data analysis and processing. [3] [4] These techniques were intended to be used by database administrators (DBAs) and by systems analysts based upon an understanding of the operational processing needs of organizations for the 1980s. In particular, these techniques were meant to help bridge the gap between strategic business planning and information systems. A key early contributor (often called the "father" of information engineering methodology) was the Australian Clive Finkelstein, who wrote several articles about it between 1976 and 1980, and also co-authored an influential Savant Institute report on it with James Martin. [5] [6] [7] Over the next few years, Finkelstein continued work in a more business-driven direction, which was intended to address a rapidly changing business environment; Martin continued work in a more data processing-driven direction. From 1983 to 1987, Charles M. Richter, guided by Clive Finkelstein, played a significant role in revamping IEM as well as helping to design the IEM software product (user data), which helped automate IEM.

In the early 2000s, the data and data tooling was generally held by the information technology (IT) teams in most companies. [8] Other teams then used data for their work (e.g. reporting), and there was usually little overlap in data skillset between these parts of the business.

In the early 2010s, with the rise of the internet, the massive increase in data volumes, velocity, and variety led to the term big data to describe the data itself, and data-driven tech companies like Facebook and Airbnb started using the phrase data engineer. [3] [8] Due to the new scale of the data, major firms like Google, Facebook, Amazon, Apple, Microsoft, and Netflix started to move away from traditional ETL and storage techniques. They started creating data engineering, a type of software engineering focused on data, and in particular infrastructure, warehousing, data protection, cybersecurity, mining, modelling, processing, and metadata management. [3] [8] This change in approach was particularly focused on cloud computing. [8] Data started to be handled and used by many parts of the business, such as sales and marketing, and not just IT. [8]

Tools

Compute

High-performance computing is critical for the processing and analysis of data. One particularly widespread approach to computing for data engineering is dataflow programming, in which the computation is represented as a directed graph (dataflow graph); nodes are the operations, and edges represent the flow of data. [9] Popular implementations include Apache Spark, and the deep learning specific TensorFlow. [9] [10] [11] More recent implementations such as Differential/Timely Dataflow have used incremental computing for much more efficient data processing. [9] [12] [13]

Storage

Data is stored in a variety of ways, one of the key deciding factors is in how the data will be used. Data engineers optimize data storage and processing systems to reduce costs. They use data compression, partitioning, and archiving.

Databases

If the data is structured and some form of online transaction processing is required, then databases are generally used. [14] Originally mostly relational databases were used, with strong ACID transaction correctness guarantees; most relational databases use SQL for their queries. However, with the growth of data in the 2010s, NoSQL databases have also become popular since they horizontally scaled more easily than relational databases by giving up the ACID transaction guarantees, as well as reducing the object-relational impedance mismatch. [15] More recently, NewSQL databases — which attempt to allow horizontal scaling while retaining ACID guarantees — have become popular. [16] [17] [18] [19]

Data warehouses

If the data is structured and online analytical processing is required (but not online transaction processing), then data warehouses are a main choice. [20] They enable data analysis, mining, and artificial intelligence on a much larger scale than databases can allow, [20] and indeed data often flow from databases into data warehouses. [21] Business analysts, data engineers, and data scientists can access data warehouses using tools such as SQL or business intelligence software. [21]

Data lakes

A data lake is a centralized repository for storing, processing, and securing large volumes of data. A data lake can contain structured data from relational databases, semi-structured data, unstructured data, and binary data. A data lake can be created on premises or in a cloud-based environment using the services from public cloud vendors such as Amazon, Microsoft, or Google.

Files

If the data is less structured, then often they are just stored as files. There are several options:

Management

The number and variety of different data processes and storage locations can become overwhelming for users. This inspired the usage of a workflow management system (e.g. Airflow) to allow the data tasks to be specified, created, and monitored. [24] The tasks are often specified as a directed acyclic graph (DAG). [24]

Lifecycle

Business planning

Business objectives that executives set for what's to come are characterized in key business plans, with their more noteworthy definition in tactical business plans and implementation in operational business plans. Most businesses today recognize the fundamental need to grow a business plan that follows this strategy. It is often difficult to implement these plans because of the lack of transparency at the tactical and operational degrees of organizations. This kind of planning requires feedback to allow for early correction of problems that are due to miscommunication and misinterpretation of the business plan.

Systems design

The design of data systems involves several components such as architecting data platforms, and designing data stores. [25] [26]

Data modeling

This is the process of producing a data model, an abstract model to describe the data and relationships between different parts of the data. [27]

Roles

Data engineer

A data engineer is a type of software engineer who creates big data ETL pipelines to manage the flow of data through the organization. This makes it possible to take huge amounts of data and translate it into insights. [28] They are focused on the production readiness of data and things like formats, resilience, scaling, and security. Data engineers usually hail from a software engineering background and are proficient in programming languages like Java, Python, Scala, and Rust. [29] [3] They will be more familiar with databases, architecture, cloud computing, and Agile software development. [3]

Data scientist

Data scientists are more focused on the analysis of the data, they will be more familiar with mathematics, algorithms, statistics, and machine learning. [3]

See also

Related Research Articles

<span class="mw-page-title-main">Database</span> Organized collection of data in computing

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

<span class="mw-page-title-main">IBM Db2</span> Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB2 until 2017, when it changed to its present form.

Computer science is the study of the theoretical foundations of information and computation and their implementation and application in computer systems. One well known subject classification system for computer science is the ACM Computing Classification System devised by the Association for Computing Machinery.

Online analytical processing, or OLAP, is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

<span class="mw-page-title-main">Extract, transform, load</span> Procedure in computing

In computing, extract, transform, load (ETL) is a three-phase process where data is extracted from an input source, transformed, and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using software applications but it can also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on reccurring schedules either as single jobs or aggregated into a batch of jobs.

Clive Finkelstein is an Australian computer scientist, known as the "Father" of information engineering methodology.

In computing, the term data warehouse appliance (DWA) was coined by Foster Hinshaw for a computer architecture for data warehouses (DW) specifically marketed for big data analysis and discovery that is simple to use and has a high performance for the workload. A DWA includes an integrated set of servers, storage, operating systems, and databases.

<span class="mw-page-title-main">Microsoft Azure SQL Database</span> Managed cloud database

Microsoft Azure SQL Database is a managed cloud database (PaaS) cloud-based Microsoft SQL Servers, provided as part of Microsoft Azure services. The service handles database management functions for cloud based Microsoft SQL Servers including upgrading, patching, backups, and monitoring without user involvement.

Infobright is a commercial provider of column-oriented relational database software with a focus in machine-generated data. The company's head office is located in Toronto, Ontario, Canada. Most of its research and development is based in Warsaw, Poland. Support personnel are located in various offices around the world.

SQLstream is a distributed, SQL standards-compliant plus Java stream processing platform. SQLstream, Inc. is based in San Francisco, California and was launched in 2009 by Damian Black, Edan Kabatchnik and Julian Hyde, author of the open source Mondrian Relational OLAP Server Engine.

A cloud database is a database that typically runs on a cloud computing platform and access to the database is provided as-a-service. There are two common deployment models: users can run databases on the cloud independently, using a virtual machine image, or they can purchase access to a database service, maintained by a cloud database provider. Of the databases available on the cloud, some are SQL-based and some use a NoSQL data model.

The term is used for two different things:

  1. In computer science, in-memory processing (PIM) is a computer architecture in which data operations are available directly on the data memory, rather than having to be transferred to CPU registers first. This may improve the power usage and performance of moving data between the processor and the main memory.
  2. In software engineering, in-memory processing is a software architecture where a database is kept entirely in random-access memory (RAM) or flash memory so that usual accesses, in particular read or query operations, do not require access to disk storage. This may allow faster data operations such as "joins", and faster reporting and decision-making in business.

The following is provided as an overview of and topical guide to databases:

<span class="mw-page-title-main">SingleStore</span> Database management system

SingleStore is a proprietary, cloud-native database designed for data-intensive applications. A distributed, relational, SQL database management system (RDBMS) that features ANSI SQL support, it is known for speed in data ingest, transaction processing, and query processing.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

Amazon Redshift is a data warehouse product which forms part of the larger cloud-computing platform Amazon Web Services. It is built on top of technology from the massive parallel processing (MPP) data warehouse company ParAccel, to handle large scale data sets and database migrations. Redshift differs from Amazon's other hosted database offering, Amazon RDS, in its ability to handle analytic workloads on big data data sets stored by a column-oriented DBMS principle. Redshift allows up to 16 petabytes of data on a cluster compared to Amazon RDS Aurora's maximum size of 128 terabytes.

<span class="mw-page-title-main">Actian</span> American software company

Actian is an American software company headquartered in Santa Clara, California that provides analytics-related software, products, and services. The company sells database software and technology, cloud engineered systems, and data integration solutions.

<span class="mw-page-title-main">Apache Flink</span> Framework and distributed processing engine

Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. Furthermore, Flink's runtime supports the execution of iterative algorithms natively.

Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

References

  1. "What is Data Engineering? | A Quick Glance of Data Engineering". EDUCBA. January 5, 2020. Retrieved July 31, 2022.
  2. "Introduction to Data Engineering". Dremio. Retrieved July 31, 2022.
  3. 1 2 3 4 5 6 Black, Nathan (January 15, 2020). "What is Data Engineering and Why Is It So Important?". QuantHub. Retrieved July 31, 2022.
  4. "Information Engineering - an overview | ScienceDirect Topics". www.sciencedirect.com. Retrieved August 23, 2022.
  5. "Information engineering," part 3, part 4, part 5, Part 6" by Clive Finkelstein. In Computerworld, In depths, appendix. May 25 – June 15, 1981.
  6. Christopher Allen, Simon Chatwin, Catherine Creary (2003). Introduction to Relational Databases and SQL Programming.
  7. Terry Halpin, Tony Morgan (2010). Information Modeling and Relational Databases. p. 343
  8. 1 2 3 4 5 Dodds, Eric. "The History of the Data Engineering and the Megatrends". Rudderstack. Retrieved July 31, 2022.
  9. 1 2 3 Schwarzkopf, Malte (March 7, 2020). "The Remarkable Utility of Dataflow Computing". ACM SIGOPS. Retrieved July 31, 2022.
  10. "sparkpaper" (PDF). Retrieved July 31, 2022.
  11. Abadi, Martin; Barham, Paul; Chen, Jianmin; Chen, Zhifeng; Davis, Andy; Dean, Jeffrey; Devin, Matthieu; Ghemawat, Sanjay; Irving, Geoffrey; Isard, Michael; Kudlur, Manjunath; Levenberg, Josh; Monga, Rajat; Moore, Sherry; Murray, Derek G.; Steiner, Benoit; Tucker, Paul; Vasudevan, Vijay; Warden, Pete; Wicke, Martin; Yu, Yuan; Zheng, Xiaoqiang (2016). "TensorFlow: A system for large-scale machine learning". 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). pp. 265–283. Retrieved July 31, 2022.
  12. McSherry, Frank; Murray, Derek; Isaacs, Rebecca; Isard, Michael (January 5, 2013). "Differential dataflow". Microsoft . Retrieved July 31, 2022.
  13. "Differential Dataflow". Timely Dataflow. July 30, 2022. Retrieved July 31, 2022.
  14. "Lecture Notes | Database Systems | Electrical Engineering and Computer Science | MIT OpenCourseWare". ocw.mit.edu. Retrieved July 31, 2022.
  15. Leavitt, Neal (2010). "Will NoSQL Databases Live Up to Their Promise?" (PDF). IEEE Computer . 43 (2): 12–14. doi:10.1109/MC.2010.58. S2CID   26876882.
  16. Aslett, Matthew (2011). "How Will The Database Incumbents Respond To NoSQL And NewSQL?" (PDF). 451 Group (published April 4, 2011). Retrieved February 22, 2020.
  17. Pavlo, Andrew; Aslett, Matthew (2016). "What's Really New with NewSQL?" (PDF). SIGMOD Record. Retrieved February 22, 2020.
  18. Stonebraker, Michael (June 16, 2011). "NewSQL: An Alternative to NoSQL and Old SQL for New OLTP Apps". Communications of the ACM Blog. Retrieved February 22, 2020.
  19. Hoff, Todd (September 24, 2012). "Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In" . Retrieved February 22, 2020.
  20. 1 2 "What is a Data Warehouse?". www.ibm.com. Retrieved July 31, 2022.
  21. 1 2 "What is a Data Warehouse? | Key Concepts | Amazon Web Services". Amazon Web Services, Inc. Retrieved July 31, 2022.
  22. 1 2 3 "File storage, block storage, or object storage?". www.redhat.com. Retrieved July 31, 2022.
  23. "Cloud Object Storage – Amazon S3 – Amazon Web Services". Amazon Web Services, Inc. Retrieved July 31, 2022.
  24. 1 2 "Home". Apache Airflow. Retrieved July 31, 2022.
  25. "Introduction to Data Engineering". Coursera. Retrieved July 31, 2022.
  26. Finkelstein, Clive. What are The Phases of Information Engineering.
  27. "What is Data Modelling? Overview, Basic Concepts, and Types in Detail". Simplilearn.com. June 15, 2021. Retrieved July 31, 2022.
  28. Tamir, Mike; Miller, Steven; Gagliardi, Alessandro (December 11, 2015). "The Data Engineer". Rochester, NY. doi:10.2139/ssrn.2762013. S2CID   113342650. SSRN   2762013.{{cite journal}}: Cite journal requires |journal= (help)
  29. "Data Engineer vs. Data Scientist". Springboard Blog. February 7, 2019. Retrieved March 14, 2021.

Further reading