from functools import reduce
# Sample data: List of sentences
= [
data "hello world",
"hello",
"hello map reduce",
"map reduce example"
]
13 MapReduce
The MapReduce workflow is a programming model used for processing large data sets in parallel, typically in a distributed computing environment. It is widely used in big data processing frameworks like Hadoop and Spark. The concept revolves around two main operations:
- Map: Apply a function to each item in an input data set to produce a set of intermediate key-value pairs.
- Reduce: Combine all the intermediate values associated with the same key to produce the final result.
13.1 MapReduce Workflow in Python
In Python, you can implement a MapReduce-style workflow using built-in functions like map()
and reduce()
(from the functools
module). While the traditional MapReduce model is meant for distributed systems, the same concept can be used in Python for processing data in a more manageable and parallelizable manner.
13.1.1 Explanation of Map and Reduce:
Map:
- The
map()
function applies a given function to each item in an input iterable (e.g., list, tuple) and returns an iterator with the results. - This is equivalent to the “mapping” operation in the MapReduce model, where you transform input data into intermediate data.
- The
Reduce:
- The
reduce()
function from thefunctools
module accumulates a result by applying a binary function (a function that takes two arguments) to elements of an iterable, combining them into a single output value. - This is equivalent to the “reducing” operation in the MapReduce model, where you aggregate or combine intermediate results to produce the final output.
- The
13.1.2 Example: Word Count using MapReduce in Python
A classic example of MapReduce is the Word Count problem, where you want to count the frequency of each word in a list of sentences.
13.1.2.1 Step 1: Map
- Create a list of (word, 1) pairs
def map_words(sentence):
= sentence.split()
words return [(word, 1) for word in words]
# Applying the map function to all sentences and converting the map object to a list
= list(map(map_words, data))
mapped mapped
[[('hello', 1), ('world', 1)],
[('hello', 1)],
[('hello', 1), ('map', 1), ('reduce', 1)],
[('map', 1), ('reduce', 1), ('example', 1)]]
# Flatten the list of lists into a single list of (word, 1) pairs
= [pair for sublist in mapped for pair in sublist]
flattened flattened
[('hello', 1),
('world', 1),
('hello', 1),
('hello', 1),
('map', 1),
('reduce', 1),
('map', 1),
('reduce', 1),
('example', 1)]
13.1.2.2 Step 2: Reduce phase
- Sum up the values for each unique word
# First, group by words using a dictionary
= {}
word_count_dict for word, count in flattened:
if word in word_count_dict:
+= count
word_count_dict[word] else:
= count
word_count_dict[word]
word_count_dict
{'hello': 3, 'world': 1, 'map': 2, 'reduce': 2, 'example': 1}
Or
def reduce_word_counts(acc, pair):
= pair
word, count if word in acc:
+= count
acc[word] else:
= count
acc[word] return acc
= reduce(reduce_word_counts, flattened, {})
word_count_dict word_count_dict
{'hello': 3, 'world': 1, 'map': 2, 'reduce': 2, 'example': 1}
13.1.2.3 Explanation:
Map Phase:
- The function
map_words(sentence)
takes a sentence, splits it into words, and returns a list of(word, 1)
pairs. map(map_words, data)
applies this function to each sentence indata
, resulting in a list of lists of(word, 1)
pairs.
- The function
Flattening:
- The nested list produced by
map()
is flattened into a single list of(word, 1)
pairs using a list comprehension:[pair for sublist in mapped for pair in sublist]
.
- The nested list produced by
Reduce Phase:
- The word count is accumulated in a dictionary,
word_count_dict
. For each(word, 1)
pair, the count for that word is increased in the dictionary. - Alternatively, you can use the
reduce()
function to perform this operation, as shown in the commented-out section.
- The word count is accumulated in a dictionary,
13.1.2.4 Summary
- Map: Transforms the data into key-value pairs.
- Reduce: Aggregates or combines the key-value pairs to produce the final result.
13.1.2.5 Why Use MapReduce?
It is ideal for parallel processing large datasets by breaking down the work into smaller, manageable chunks.
The
map()
andreduce()
functions in Python allow you to mimic this process on smaller data sets or locally, without needing a distributed environment.
While this example uses a simple in-memory approach, the MapReduce framework is typically used in distributed computing for large-scale data processing tasks.