Solution :
Step 1 : Create all three files in hdfs in directory called spark2 (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs
Step 2 : Load the Content.txt file
val content = sc.textFile("spark2/Content.txt") //Load the text file
Step 3 : Load the Remove.txt file
val remove = sc.textFile("spark2/Remove.txt") //Load the text file
Step 4 : Create an RDD from remove, However, there is a possibility each word could have trailing spaces, remove those whitespaces as well. We have used two functions here flatMap, map and trim.
val removeRDD= remove.flatMap(x=> x.splitf',") ).map(word=>word.trim)//Create an array of words
Step 5 : Broadcast the variable, which you want to ignore
val bRemove = sc.broadcast(removeRDD.collect().toList) // It should be array of Strings
Step 6 : Split the content RDD, so we can have Array of String. val words = content.flatMap(line => line.split(" "))
Step 7 : Filter the RDD, so it can have only content which are not present in "Broadcast Variable". val filtered = words.filter{case (word) => !bRemove.value.contains(word)}
Step 8 : Create a PairRDD, so we can have (word,1) tuple or PairRDD. val pairRDD = filtered.map(word => (word,1))
Step 9 : Nowdo the word count on PairRDD. val wordCount = pairRDD.reduceByKey(_ + _)
Step 10 : Save the output as a Text file.
wordCount.saveAsTextFile("spark2/result.txt")