GWU Online Engineering Course Takeway

Cloud and Big Data Management

Modern big data architectures : a multi-agent systems perspective

Objectives

1. Understand the differences between traditional computing and cloud computing
2. Learn about different cloud service models and when to choose which one
3. Understand the sources of Big Data
4. Understand the different types of Big Data tools
5. Understand performance analysis and data analysis with Big Data systems

Takeaways

In this course, I have learned the structures of NoSQL databases, which include key-value pairs, document-oriented databases, graph databases, and column-oriented databases. NoSQL databases support large volumes of structured, unstructured, and semi-structured data, and are schemeless, making it easy to integrate data from different sources. The consistency, availability, and partition tolerance (CAP) theorem states that a database cannot exhibit more than two of these properties at the same time. NoSQL databases target either consistency and partition tolerance or availability and partition tolerance.
NoSQL databases exhibit the BASE properties (Basically Available, Soft state, and Eventual consistency) rather than the ACID properties (Atomicity, Consistency, Isolation, and Durability) of traditional relational databases. NoSQL databases organize data in key/value pairs or tuples, and support SQL as well as other languages such as HQL, XQuery, and SPARQL to query structured data.

The lecture also covers the four components of Hadoop: the data storage layer (HDFS and HBase), the data processing layer (MapReduce and YARN), the data access layer (Hive, Pig, Mahout, Avro, and SQOOP), and the data management layer (Oozie, Chukwa, Flume, and Zookeeper). HDFS stores data in a distributed environment and provides fault tolerance and availability by replication. HBase is suitable for low latency applications and stores structured data by partitioning it into small chunks.

HDFS has a master (NameNode) that manages the namespace of the file system, monitors the health of DataNodes, and controls file access by end users. DataNodes store the actual data and are distributed across a cluster. HDFS splits files into blocks of 64 MB and maps blocks of data to DataNodes, with each block stored in 3 DataNodes for reliability. HDFS has a secondary NameNode to back up the memory of the main NameNode, and HDFS federation is introduced to overcome memory limitations and allow for additional independent NameNodes to be added.

The lecture also covers the steps involved in writing data to HDFS, which include connecting to the NameNode, creating a new record for storing metadata, initiating file creation, identifying DataNodes based on the number of replicas, and splitting the input file into blocks of 64 MB. HDFS is not suitable for applications that require low latency or storing a large number of small files, and the memory of the NameNode may be limited.

I have also learned a Hadoop, a framework that facilitates the storage and processing of large datasets. Hadoop's storage system, Hadoop Distributed File System (HDFS), stores data in blocks, which are sent to DataNodes in packets. The writing process is carried out in a pipeline fashion, with each DataNode in charge of passing the data packet to the next. HDFS read process involves a client initiating a read request, DFS requesting metadata on the block's location, and the NameNode returning the DataNodes' location holding copies of the block. The MapReduce programming model, which is a batch processing model, is also discussed. The model breaks down the processing of data into two phases: Map and Reduce. The course goes on to describe Hadoop 1.0 limitations, including JobTracker's single point of failure and H1's inability to run non-MapReduce applications. Hadoop 2.0 was created to address these issues, and its features, including the ResourceManager, ApplicationMaster, and YARN, are described. Finally, Little's Law and its application in computing are discussed.