Hadoop and MapReduce  MCQs

Hadoop
What is Hadoop primarily used for?
a) Data visualization
b) Large-scale data processing
c) Database management
d) Network security
Answer: b) Large-scale data processing

Which component of Hadoop is responsible for distributed storage?
a) MapReduce
b) YARN
c) HDFS
d) Hive
Answer: c) HDFS

What does HDFS stand for?
a) Hadoop Distributed File System
b) Hadoop Data File Storage
c) Highly Distributed File System
d) Hadoop Data File System
Answer: a) Hadoop Distributed File System

Which component of Hadoop is responsible for resource management and job scheduling?
a) HDFS
b) YARN
c) MapReduce
d) Hive
Answer: b) YARN

How does Hadoop achieve fault tolerance?
a) By replicating data across multiple nodes
b) By using a single powerful server
c) By storing data in memory
d) By using a centralized database
Answer: a) By replicating data across multiple nodes

MapReduce
What is the primary purpose of the MapReduce programming model?
a) Data storage
b) Data visualization
c) Parallel data processing
d) Network communication
Answer: c) Parallel data processing

In MapReduce, what is the role of the “Map” function?
a) To sort the data
b) To process input data and produce intermediate key-value pairs
c) To combine the results
d) To store the output data
Answer: b) To process input data and produce intermediate key-value pairs

What does the “Reduce” function do in the MapReduce framework?
a) It processes intermediate key-value pairs and produces the final output
b) It sorts the data
c) It splits the data into chunks
d) It combines multiple files into one
Answer: a) It processes intermediate key-value pairs and produces the final output

In a MapReduce job, where is the intermediate data stored between the Map and Reduce phases?
a) HDFS
b) Local file system of the mapper nodes
c) Memory of the reducer nodes
d) YARN resource manager
Answer: b) Local file system of the mapper nodes

What is the role of the Combiner function in MapReduce?
a) To combine the output of multiple reducers
b) To perform local aggregation of intermediate results before passing them to the reducer
c) To split the input data
d) To sort the final output
Answer: b) To perform local aggregation of intermediate results before passing them to the reducer

Hadoop Ecosystem
Which Hadoop ecosystem component is used for querying and managing large datasets residing in distributed storage using SQL?
a) Pig
b) Hive
c) HBase
d) Sqoop
Answer: b) Hive

What is Apache Pig used for in the Hadoop ecosystem?
a) High-level scripting for data analysis
b) SQL-based querying
c) Real-time data processing
d) Data visualization
Answer: a) High-level scripting for data analysis

Which component of the Hadoop ecosystem provides a NoSQL database that runs on top of HDFS?
a) Pig
b) Hive
c) HBase
d) Sqoop
Answer: c) HBase

What is the primary function of Apache Sqoop?
a) To move bulk data between Hadoop and structured datastores
b) To perform real-time processing
c) To visualize data
d) To provide a distributed file system
Answer: a) To move bulk data between Hadoop and structured datastores

Which component of the Hadoop ecosystem is used for real-time stream processing?
a) Flume
b) Oozie
c) Spark
d) Kafka
Answer: d) Kafka

General Hadoop and MapReduce
What is the default block size in HDFS?
a) 32 MB
b) 64 MB
c) 128 MB
d) 256 MB
Answer: c) 128 MB

How does Hadoop ensure data integrity in HDFS?
a) By using checksums
b) By storing multiple copies of the same data
c) By using encryption
d) By storing data in memory
Answer: a) By using checksums

Which component in YARN is responsible for tracking the status of applications?
a) ResourceManager
b) NodeManager
c) ApplicationMaster
d) JobTracker
Answer: c) ApplicationMaster

What is the role of the ResourceManager in YARN?
a) To manage the global assignment of resources to applications
b) To execute MapReduce jobs
c) To store the input data
d) To manage data replication
Answer: a) To manage the global assignment of resources to applications

Which of the following is a benefit of using Hadoop for data mining?
a) Scalability
b) High cost
c) Centralized data storage
d) Limited fault tolerance
Answer: a) Scalability