Video: Apache Spark Full Course – Learn Apache Spark in 8 Hours | Apache Spark Tutorial | Edureka
For the past five years Spark has been on an absolute tear becoming one of the most widely used Technologies in big data and AI. Today's cutting-edge companies like Facebook app will Netflix Uber and many more have deployed spark at massive scale processing petabytes of data to deliver Innovations ranging from detecting fraudulent Behavior to delivering personalized experiences in real. Lifetime and many such innovations that are transforming every industry. Hi all I welcome you all to this full court session on Apache spark a complete crash course consisting of everything you need to know to get started with Apache Spark from scratch. But before we get into details, let's look at our agenda for today for better understanding and ease of learning.
The entire crash course is divided into 12 modules in the first module introduction to spark will try to understand what exactly Is and how it performs real time processing in second module will dive deep into different components that constitute spark will also learn about Spark architecture and its ecosystem next up in the third module. We will learn what exactly relational distributed data sets are in spark. Fourth module is all about data frames in this module. We will learn what exactly data frames are and how to perform different operations in data frames moving on in the fifth. Module we will discuss different ways that spark provides to perform SQL queries for accessing and processing data in the six module. We will learn how to perform streaming on live data streams using spark where and in the seventh module will discuss how to execute different machine learning algorithms using spark machine learning library 8 module is all about spark Graphics in this module.
We are going to learn what graph processing is and how to perform graph processing using Bob Graphics library in the ninth module will discuss the key differences between two popular data processing Paddock rooms mapreduce and Spark talking about 10 module will integrate to popular James spark and Kafka. 11th module is all about pyspark in this module will try to understand how by spark exposes spark programming model to python lastly in the 12 module. We'll take a look at most frequently Asked interview. Options on spark which will help you Ace your interview with flying colors. Thank you guys while you are at it, please do not forget to subscribe and Edureka YouTube channel to stay updated with current training Technologies. There has been – underworld that spark is a future of Big Data platform, which is hundred times faster than mapreduce and is also a go-to tool for all solutions.
But what exactly is Apache spark and what? It's so popular. And in the session I will give you a complete Insight of Apache spark and its fundamentals without any further due. Let's quickly. Look at the topics to be covered in this session first and foremost. I will tell you what is Apache spark and its features next. I will take you to the components of spark ecosystem that makes Park as a future of Big Data platform. After that. I will talk about the fundamental data structure of spark that is rdd I will also tell you about its features its Asians the ways to create rdd Etc and at the last either wrap up the session by giving a real-time use case of spark. So let's get started with the very first topic and understand what is spark spark is an open-source killable massively parallel in memory execution environment for running analytics applications. You can just think of it as an in-memory layer that sits about the multiple data stores where data can be loaded into the memory and analyzed in parallel across the cluster. Into big data processing much like mapreduce Park Works to distribute the data across the cluster and then process that data in parallel.
The difference here is that unlike mapreduce which shuffles the files around the disc spark Works in memory, and that makes it much faster at processing the data than mapreduce. It is also said to be the Lightning Fast unified analytics engine for big data and machine learning. So now let's look at the interesting features of Apache Spark. Coming to speed you can cause Park as a swift processing framework. Why because it is hundred times faster in memory and 10 times faster on the disk on comparing it with her. Do not only that it also provides High data processing speed next powerful cashing. It has a simple programming layer that provides powerful caching and disk persistence capabilities and Spark can be deployed through mesos. How do PI on or Sparks own cluster manager as you all know? That's Park itself was designed and developed for real-time data processing. So it's obvious fact that it offers real-time competition and low latency because of in memory competitions next polyglot spark provides high level apis in Java Scala Python and our spark code can be written in any of these four languages. Not only that it also provides a shell in Scala and python.
These are the various features of spark now, let's see the The various components of spark ecosystem. Let me first tell you about the spark or component. It is the most vital component of Spartacus system, which is responsible for basic I/O functions scheduling monitoring Etc. The entire Apache spark ecosystem is built on the top of this core execution engine which has extensible apis in different languages like Scala python are and Chava as I have already mentioned the spark and the departs from essos. How do you feel John or Sparks own cluster manager the spark ecosystem library is composed of various components like spark SQL spark streaming machine learning library. Now, let me explain you each of them. The spark SQL component is used to Leverage The Power of declarative queries and optimize storage by executing SQL queries on spark data, which is present in the rdds and other external sources next Sparks trimming component allows developers to perform batch.
Processing and streaming of data in the same application and come into machine learning library. It eases the deployment and development of scalable machine learning pipelines, like summary statistics correlations feature extraction transformation functions optimization algorithms Etc and graph x component lets the data scientist to work with graph are non rough sources to achieve flexibility and resilience and graph construction and transformation and now talking about the programming languages spark supports car. I just a functional programming language in which the spark is written. So spark supports Colour as the interface then spark also supports python interface. You can write the program in Python and execute it over the spark again. If you see the code in Python and Scala, both are very similar then our is very famous for data analysis and machine learning.
So spark has also added the support for our and it also supports Java so you can go ahead and write the code in Java and Giggle with this park next the data can be stored in hdfs local file system Amazon S3 cloud and it also supports SQL and nosql database as well. So this is all about the various components of spark ecosystem. Now, let's see what's next when it comes to iterative distributed computing that is processing the data over multiple jobs and competitions. We need to reuse or share the data among multiple jobs in earlier Frameworks like Hadoop there were problems while dealing with multiple. Operations or jobs here. We need to store the data and some intermediate stable distributed storage such as hdfs and multiple I/O operations makes the overall computations of jobs much slower and they were replications and civilizations which in turn made the process even more slower and our goal here was to reduce the number of I/O operations to hdfs and this can be achieved only through in-memory data sharing the in-memory data sharing the stent 200 times faster. Of the network and disk sharing and rdds try to solve all the problems by enabling fault-tolerant distributed in memory competitions.
So now let's understand what our rdds it stands for resilient distributed data set. They are considered to be the backbone of spark and is one of the fundamental data structure of spark. It is also known as the schema-less structures that can handle both structured and unstructured data. So in spark anything you do is around rdd. You're reading the data in spark. When it is read into our daily again, when you're transforming the data, then you're performing Transformations on old rdd and creating a new one. Then at last you will perform some actions on the rdd and store that data present in an rdd to a persistent storage resilient distributed data set has an immutable distributed collection of objects. Your objects can be anything like strings lines Rose objects collections Etc rdds can contain any type of python Java or Scala objects.
Even including user defined classes as And talking about the distributed environment. Each data set present in an rdd is divided into logical partitions, which may be computed on different nodes of the cluster due to this you can perform Transformations or actions on the complete data parallely and I don't have to worry about the distribution because spark takes care of that are they these are highly resilient that is they are able to recover quickly from any issues as a same data chunks are replicated across multiple executor notes thus so even if one executor fails another will still process the data. This allows you to perform functional calculations against a data set very quickly by harnessing the power of multiple nodes. So this is all about rdd now. Let's have a look at some of the important features of our dbe's rdds have a provision of in memory competition and all transformations are lazy. That is it does not compute the results right away until an action is applied. So it supports in memory competition and lazy evaluation as well next.
Fault tolerant in case of rdds. They track the data lineage information to rebuild the last data automatically and this is how it provides fault tolerance to the system. Next immutability data can be created or received any time and once defined its value cannot be changed. And that is the reason why I said are they these are immutable next partitioning at is the fundamental unit of parallelism and Spark rdd and all the data chunks are divided into partitions and already next persistence. So users can reuse rdd and choose a storage stategy for them coarse-grained operations applies to all elements in datasets through Maps or filter or group by operations. So these are the various features of our daily. Now, let's see the ways to create rdd. There are three ways to create rdds one can create rdd from paralyzed Collections and one can also create rdd from the existing card ID or other are DTS and it can also be created from external data sources as well like hdfs. Amazon S3 hbase Etc.
Now let me show you how to create rdds. I'll open my terminal and first check whether my demons are running or not. Cool here. I can see that Hadoop and Spark demons both are running. So now at the first let's start the spark shell it will take a bit time to start the shell. Cool. Now the spark shall has started and I can see the version of spark as two point one point one and we have a scholar shell over here. Now. I will tell you how to create rdds in three different ways using Scala language at the first. Let's see how to create an rdd from paralyzed collections SC dot paralyzes the method that I use to create a paralyzed collection of oddities and this method is a spark context paralyzed method to create a palace collection. So I will give a seedot bad. Lice and here I will paralyze one 200 numbers. In five different partitions and I will apply collect as action to start the process.
So here in the result, you can see an array of fun 200 numbers. Okay. Now let me show you how the partitions appear in the web UI of spark. So the web UI port for spark is localhost four zero four zero. So here you have just completed one task. That is St. Dot paralyzed collect. Correct here. You can see all the five stages that are succeeded because we have divided the task into five partitions. So let Show you the partitions. So this is a dag which realization that is the directed acyclic graph visualization wherein you have applied only paralyzed as a method so you can see only one stage here. So here you can see the rdd that is been created and coming to even timeline you can see the task that has been executed in five different stages and the different colors imply.
The scheduler delayed tasks these sterilization Time shuffle rate Time shuffle right time. I'm execute a Computing time Etc here. You can see the summary metrics for the created rdd here. You can see that the maximum time it took to execute the tasks in five partitions parallely is just 45 milliseconds. You can also see the executor ID the host ID the status that is succeeded duration launch time Etc. So this is one way of creating an rdd from paralyzed collections. Now, let me show you how to create an rdd from the I think our DD okay here I'll create an array called Aven and assign numbers one to ten. One two, three, four five six seven. Okay, so I got the result here. That is I have created an integer array of 1 to 10 and now I will paralyze this a day one. Sorry, I got an error. It is a seedot pass the lies of a one.
Okay, so I created an rdd called parallel collection cool. Now I will create a new Oddity from the existing already. That is Val new are d d is equal to a 1 dot map data present in an rdd. I will create a new ID from existing rdd. So here I will take a one. As a difference and map the data and multiply that data into two. So what should be our output if I Mark the data present in an rdd into two, so it would be like 2 4 6 8 up to 20, correct? So, let's see how it works. Yes, we got the output that is multiple of 1 to 10. That is two four six eight up to 20. So this is one of the method of creating a new ID from an old rdt. And I have one more method that is from external file sources. So what I will do here is I will give that test is equal to SC dot txt file here. I will give the path to hdfs file location and Link the path that is hdfs who localhost 9000 is a path and I have a folder. Called example and in that I have a file called sample.
Cool, so I got one more already created here. Now. Let me show you this file that I have already kept in hdfs directory. I will browse the file system and I will show you the / example directory that I have created. So here you can see the example that I have created as a directory and here I have sample as input file that I have been given. So here you can see the same path location. So this is how I can create an rdd from external file sources. In this case. I have used hdfs as an external file source. So this is how we can create rdds from three different ways that is paralyzed collections from external RDS and from an existing rdds. So let's move further and see the various rdd. It's actually supports two men operations namely Transformations and actions as have already set. Our treaties are immutable. So once you create an rdd, you cannot change any content in the Hardy, so you might be wondering how our did he applies those Transformations? Correct? When you run any Transformations, it runs those Transformations on all our DD and create a new body. This is basically done for optimization reasons. Transformations are the operations which are applied on a An rdd to create a new rdd now these Transformations work on the principle of lazy evaluations.
So what does it mean it means that when we call some operation in rdd at does not execute immediately and Spark montañés, the record of the operation that is being called since Transformations are lazy in nature so we can execute the operation any time by calling an action on the data. Hence in lazy evaluation data is not loaded until it is necessary now these Since analyze the RTD and produce result simple action can be count which will count the rows and rdd and then produce a result so I can say that transformation produced new rdd and action produced results before moving further with the discussion. Let me tell you about the three different workloads that spark it is they are batch mode interactive mode and streaming mode in case of batch mode. We run a batch of you write a job and then schedule it it works through a queue or a batch of separate. Jobs without manual intervention then in case of interactive mode. It is an interactive shell where you go and execute the commands one by one. So you will execute one command check the result and then execute other command based on the output result and so on it works similar to the SQL shell so she'll is the one which executes a driver program and in the Shell mode.
You can run it on the cluster mode. It is generally used for development work or it is used for ad hoc queries, then comes the streaming mode where the program is continuously running. As invented data comes it takes a data and do some Transformations and actions on the data and get some results. So these are the three different workloads that spark 8 us now. Let's see a real-time use case here. I'm considering Yahoo! As an example. So what are the problems of Yahoo! Yahoo! Properties are highly personalized to maximize relevance. The algorithms used to provide personalization. That is the targeted advertisement and personalized content are highly sophisticated. It and the relevance model must be updated frequently because stories news feed and ads change in time and Yahoo, has over 150 petabytes of data that the stored on 35,000 node Hadoop cluster, which should be access efficiently to avoid latency caused by the data movement and to gain insights from the data and cost-effective manner. So to overcome these problems Yahoo! Look to spark to improve the performance of this iterative model training here. Machine learning algorithm for news personalization required 15,000 lines of C++ code on the other hand the machine learning algorithm has just won 20 lines of Scala code.
So that is the advantage of spark and this algorithm was ready for production use in just 30 minutes of training on a hundred million datasets and Sparks Rich API is available in several programming languages and has resilient in memory storage options and a scum. Potable with Hadoop through yarn and the spark yarn project. It uses Apache spark for personalizing It's News web pages and for targeted advertising. Not only that it also uses machine learning algorithms that run an Apache spark to find out what kind of news user are interested to read and also for categorizing the new stories to find out what kind of users would be interested in Reading each category of news and Spark runs over Hadoop Ian to use existing data. And clusters and the extensive API of spark and machine learning library is the development of machine learning algorithms and Spar produces the latency of model training. We are in memory rdd. So this is how spark has helped Yahoo to improve the performance and achieve the targets. So I hope you understood the concept of spark and its fundamentals.
Now, let me just give you an overview of the Spark architecture Apache spark has a well-defined layered architecture where all the components and layers are Loosely coupled and integrated with various extensions and libraries. This architecture is based on two main abstractions. First one resilient distributed data sets that is rdd and the next one directed acyclic graph called DAC or th e in order to understand this park architecture. You need to first know the components of the spark that the spark. System and its fundamental data structure rdd. So let's start by understanding the spark ecosystem as you can see from the diagram. The spark ecosystem is composed of various components like spark SQL spark screaming machine learning library Graphics spark our and the code a pi component talking about spark SQL. It is used to Leverage The Power of declarative queries and optimize storage by executing SQL queries on spark data, which is present in rdds.
And other external sources next Sparks remain component allows developers to perform batch processing and trimming of the data and the same application coming to machine learning library. It eases the development and deployment of scalable machine learning pipelines, like summary statistics cluster analysis methods correlations dimensionality reduction techniques feature extractions and many more now Graphics component. Let's the data scientist to work with graph and non graph sources to achieve. Security and resilience and graph construction and transformation coming to spark our it is an r package that provides a light weighted front end to use Apache spark. It provides a distributed data frame implementation that supports operations like selection filtering aggregation, but on large data sets, it also supports distributed machine learning using machine learning library.
Finally the spark or component. That is the most vital component of spark ecosystem, which is responsible. Possible for basic I/O functions scheduling and monitoring the entire spark ecosystem is built on the top of this code execution engine which has extensible apis in different languages like Scala python are and Java now, let me tell you about the programming languages at the first Spark support Scala Scala is a functional programming language in which spark is written and Spark suppose Carla as an interface then spark also supports python. Face, you can write program in Python and execute it over the spark again. If you see the code and Scala and python, both are very similar then coming to our it is very famous for data analysis and machine learning. So spark has also added the support for our and it also supports Java so you can go ahead and write the Java code and execute it over the spark against Park also provides you interactive shell for Scala Python and are very can go ahead and Execute the commands one by one. So this is all about the sparkle ecosystem.
Next. Let's discuss the fundamental data structure of spark that is rdd called as resilient distributed data sets. So and Spark anything you do is around rdd, you're reading the data and Spark then it is read into R DT again. When you're transforming the data, then you're performing Transformations on an old rdd and creating a new one. Then at the last you will perform some actions on the data and store. Dataset present in an rdd to a persistent storage resilient distributed data set as an immutable distributed collection of objects. Your objects can be anything like string lines Rose objects collections Etc. Now talking about the distributed environment. Each data set in rdd is divided into logical partitions, which may be computed on different nodes of the cluster due to this you can perform Transformations and actions on the complete data parallelly. And you don't have to worry about the distribution because part takes care of that next as I said our did these are immutable. So once you create an rdd you cannot change any content in the Rd so you might be wondering how our did the applies those Transformations correct? Then you run any Transformations at runs those Transformations on all our DD and create a new Oddity.
This is basically done for optimization reasons. So, let me tell you one thing here are decals. The cached and persistent if you want to save an rdd for the future work, you can cash it and it will improve the spark performance rdd is a fault-tolerant collection of elements that can be operated on in parallel. If our DD is lost it will automatically be recomputed by using the original Transformations. This is House Park provides fault tolerance. There are two ways to create rdds first one by paralyzing an existing collection in your driver program and the second one by Referencing a data set in the external storage system such as shared file system hdfs hbase Etc. Now Transformations are the operations that you perform an rdd which will create a new body. For example, you can perform filter on an rdd and create a new rdd.
Then there are actions which analyzes the rdd and produced result simple action can be count which will count the rows in our D and producer isn't so I can say that transformation produced new ID Actions produce results. So this is all about the fundamental data structure of spark that is already now. Let's dive into the core topic of today's discussion that the Spark architecture. So this is the Spark architecture in your master node. You have the driver program which drives your application. So the code that you're writing behaves as a driver program or if you are using the interactive shell the shell acts as a driver program inside the driver program. The first thing that you do is you create a spark context assume that the spark context is a gateway to allspark functionality at a similar to your database connection.
So any command you execute in a database goes through the database connection similarly anything you do on spark goes through the spark context. Now this park on text works with the cluster manager to manage various jobs, the driver program and the spark context takes care of executing the job across the cluster a job is splitted the And then these tasks are distributed over the work or not. So anytime you create the rtt. In the spark context that rdd can be distributed across various notes and can be cashed their so rdd set to be taken partitioned and distributed across various notes now worker knows are the slave nodes whose job is to basically execute the tasks. The task is then executed on the partition rdds in the worker nodes and then Returns the result back to the spark context spot. Our context takes the job breaks the shop into the task and distribute them on the worker nodes and these tasks works on partition rdds perform, whatever operations you wanted to perform and then collect the result and give it back to the main Spar context. If your increase the number of workers, then you can divide jobs and more partitions and execute them para Leo multiple systems.
This will be actually lot more faster. Also if you increase the number of workers, it will also increase your memory. And you can catch the jobs so that it can be executed much more faster. So this is all about Spark architecture. Now. Let me give you an infographic idea about the Spark architecture. It follows master-slave architecture here. The client submits Park user application code when an application code is submitted driver implicitly converts a user code that contains Transformations and actions into a logically directed graph called DHE at this stage it also Performs optimizations such as pipelining Transformations, then it converts a logical graph called DHE into physical execution plan with many stages after converting into physical execution plan. It creates a physical execution units called tasks under each stage. Then these tasks are bundled and sent to the cluster now driver talks to the cluster manager and negotiates a resources and cluster manager launches the needed executors at this point driver be Also send the task to the executors based on the placement when executor start to register themselves with the drivers, so that driver will have a complete view of the executors and executors now start executing the tasks that are assigned by the driver program at any point of time when the application is running driver program will monitor the set of executors that runs and the driver note also schedules future tasks Based on data placement.
So this is how the internal working takes place in space. Architecture, there are three different types of workloads that spark and cater first batch mode in case of batch mode. We run a bad shop here you write the job and then schedule it. It works through a queue or batch of separate jobs through manual intervention next interactive mode. This is an interactive shell where you go and execute the commands one by one. So you'll execute one command check the result and then execute the other command based on the output result and so on it works similar to the SQL. Action social is the one which executes a driver program. So it is generally used for development work or it is also used for ad hoc queries, then comes the streaming mode where the program is continuously running as and when the data comes it takes a data and do some Transformations and actions on the data and then produce output results. So these are the three different types of workloads that spark actually caters now, let's move ahead and see a simple demo here. Let's understand how to create a spark up. Location in spark shell using Scala.
So let's understand how to create a spark application in spark shell using Scala assume that we have a text file in the hdfs directory and we are counting the number of words in that text file. So, let's see how to do it. So before I start running, let me first check whether all my demons are running or not. So I'll type sudo JPS so all my spark demons and Hadoop elements are running that I have master/worker as Park demon son named notice. Manager non-manager everything as Hadoop team it. So the first thing that I do here is I run the spark shell so it takes bit time to start in the meanwhile. Let me tell you the web UI port for spark shell is localhost for 0 4 0. So this is a web UI first Park like if you click on jobs right now, we have not executed anything. So there is no details over here. So there you have job stages.
So once you execute the chops If you'll be having the records of the tasks that you have executed here. So here you can see the stages of various jobs and tasks executed. So now let's check whether our spark shall have started or not. Yes. So you have your spark version as two point one point one and you have a scholar shell over here. So before I start the code, let's check the content that is present in the input text file by running this command. So I'll write where test is equal to SC dot txt file because I have saved a text file over there and I'll give the hdfs part location. I've stored my text file in this location. And Sample is the name of the text file. So now let me give test dot collect so that it collects the data and displays the data that is present in the text file.
So in my text file, I have Hadoop research analysts data science and science. So this is my input data. So now let me map the functions and apply the Transformations and actions. So I'll give our map is equal to SC dot txt file and I will specify my but location So this is my input part location and I'll apply the flat map transformation to split the data. There are separated by space and then map the word count to be given as word comma one now. This would be executed. Yes. Now, let me apply the action for this to start the execution of the task. So let me tell you one thing here before applying an action. This park will not start the execution process. So here I have applied produced by key as the action to start counting the number of words in the text file.
So now we are done with applying Transformations and actions as well. So now the next step is to specify the output location to store the output file. So I will give as counts dot save as text file and then specify the location form output file. I'll sort it in the same location where I have my input file. Never specify my output file name as output 9 cool. I forgot to give a double quotes. And I will run this. So it's completed now. So now let's see the output. I will open my Hadoop web UI by giving local lost Phi double zero seven zero and browse the file system to check the output. So as I have said, I have example asthma director that I have created and in that I have specified output 9 as my output.
So I have the two part files been created. Let's check each of them one by one. So we have the data count as one analyst count as one and science count as two so this is a first part file now. Let me open the second part file for you. So this is the second part file there you have Hadoop count as one and the research count as one. So now let me show you the text file that we have specified as the input. So as I have told you Hadoop counters one research count as one analyst one data one signs and signs as 1 1 so in might be thinking data science is a one word no in the program code. We have asked to count the word that the separated by a space. So that is why we have science count as two. I hope you got an idea about how word count works. Similarly, I will now paralyzed 1/200 numbers and divide the tasks into five partitions to show you what is partitions of tusks. So I will write a seedot paralyzed 1/200 numbers and divide them into five partitions and apply collect action to collect the numbers and start the execution.
So it displays you an array of 100 numbers. Now, let me explain you the job stages partitions even timeline. Dag representation and everything. So now let me go to the web UI of spark and click on jobs. So these are the tasks that have submitted so coming to word count example. So this is the dagger usual ization. I hope you can see it clearly first you collected the text file, then you applied flatmap transformation and mapped it to count the number of words and then applied Reduce by key action and then save the output file as save as text file. So this is Entire tag visualization of the number of steps that we have covered in our program. So here it shows the completed stages that is two stages and it also shows the duration that is 2 seconds. And if you click on the event timeline, it just shows the executor that is added. And in this case you cannot see any partitions because you have not split the jobs into various partitions. So this is how you can see the even timeline and the – visualization here you you can also see the stage ID descriptions when you have submitted that I have just submitted it now and in this it also shows the duration that it took to execute the task and the output pipes that it took the shuffle rate Shuffle right and many more now to show you the partitions see in this you just applied SC dot paralyzed, right? So it is just showing one stage where you have applied the parallelized transformation here.
It shows the succeeded task as Phi by Phi. That is you have divided the task into five stages and all the five stages has been executed successfully now here you can see the partitions of the five different stages that is executed in parallel. So depending on the colors, it shows the scheduler delay the shuffle rate time executor Computing time result civilization time and getting result time and many more so you can see that duration that it took to execute the five tasks in parallel at the same time as maximum. Um one milliseconds. So in memory spark as much faster computation and you can see the IDS of all the five different tasks all our success. You can see the locality level. You can see the executor and the host IP ID the launch time the duration it take everything so you can also see that we have created our DT and paralyzed it similarly here also for word count example, you can see the rdd that has been created and also the Actions that have applied to execute the task and you can see the duration that it took even here also, it's just one milliseconds that it took to execute the entire word count example, and you can see the ID is locality level executor ID.
So in this case, we have just executed the task in two stages. So it is just showing the two stages. So this is all about how web UI looks and what are the features and information that you can see in the web UI of spark after executing the program and the Scala shell. So in this program, you can see that first gave the part to the input location and check the data that is presented in the input file. And then we applied flatmap Transformations and created rdd and then applied action to start the execution of the task and save the output file in this location. So I hope you got a clear idea of how to execute a word count example and check for the various features and Spark web UI like partitions that visualisations and I hope you found the session interesting Apache spark. This word can generate a spark in every Hadoop Engineers mind. It is a big data processing framework, which is lightning fast and cluster Computing.
And the core reason behind its outstanding performance is the resilient distributed data set or in short. They are DD and today I'll focus on the topic called rdd using spark before we get Get started. Let's have a quick look on the agenda. For today's session. We shall start with understanding the need for rdds where we'll learn the reasons behind which the rdds were required. Then we shall learn what our rdds where will understand what exactly an rdd is and how do they work later? I'll walk you through the fascinating features of rdds such as in memory computation partitioning persistence fault tolerance and many more once I finished a theory I'll get your hands on rdds where will practically create and perform all possible operations on a disease and finally I'll wind up this session with an interesting Pokémon use case, which will help you understand rdds in a much better way.
Let's get started spark is one of the top mandatory skills required by each and every Big Data developer. It is used in multiple applications, which need real-time processing such as Google's recommendation engine credit card fraud detection. And many more to understand this in depth. We shall consider Amazon's recommendation engine assume that you are searching for a mobile phone and Amazon and you have certain specifications of your choice. Then the Amazon search engine understands your requirements and provides you the products which match the specifications of your choice. All this is made possible because of the most powerful tool existing in Big Data environment, which is none other than Apache spark and resilient distributed data. Is considered to be the heart of Apache spark.
So with this let's begin our first question. Why do we need a disease? Well, the current world is expanding the technology and artificial intelligence is the face for this Evolution the machine learning algorithms and the data needed to train these computers are huge the logic behind all these algorithms are very complicated and mostly run in a distributed and iterative computation method the machine learning algorithms could not use the older mapreduce. Grams, because the traditional mapreduce programs needed a stable State hdfs and we know that hdfs generates redundancy during intermediate computations which resulted in a major latency in data processing and in hdfs gathering data for multiple processing units at a single instance. First time consuming along with this the major issue was the HTF is did not have random read and write ability. So using this old mapreduce programs for machine learning problems would be Then the spark was introduced compared to mapreduce spark is an advanced big data processing framework resilient distributed data set which is a fundamental and most crucial data structure of spark was the one which made it all possible rdds are effortless to create and the mind-blowing property with solve.
The problem was it's in memory data processing capability Oddity is not a distributed file system instead. It is a distributed collection of memory where the data needed is always stored and kept available. Lynn RAM and because of this property the elevation it gave to the memory accessing speed was unbelievable The Oddities our fault tolerant and this property bought it a Dignity of a whole new level. So our next question would be what are rdds the resilient distributed data sets or the rdds are the primary underlying data structures of spark. They are highly fault tolerant and the store data amongst multiple computers in a network the data is written into multiple executable notes. So that in case of a Calamity if any executing node fails, then within a fraction of second it gets back up from the next executable node with the same processing speeds of the current node, the fault-tolerant property enables them to roll back their data to the original state by applying simple Transformations on to the Lost part in the lineage hard.
It is do not need anything called hard disk or any other secondary storage all that they need is the main memory, which is Ram now that we have understood the need for our dear. It is and what exactly an RTD is so let us see the different sources from which the data can be ingested into an rdd. The data can be loaded from any Source like hdfs hbase high C ql you name it? They got it. Hence. The collected data is dropped into an rdd. And guess what the rdds a free-spirited they can process any type of data. They won't care if the data is structured unstructured or semi-structured now, let me walk you through the features. Just of rdds, which give it an edge over the other Alternatives in memory computation the idea of in memory computation bought the groundbreaking progress in cluster Computing it increase the processing speed when compared with the hdfs moving on to Lacey evaluations the phrase lazy Explains It All spark logs all the Transformations you apply onto it and will not throw any output onto the display until an action is provoked. Next is Fault tolerance rdds are Lutely, fault-tolerant.
Any lost partition of an rdd can be rolled back by applying simple Transformations on to the last part in the lineage speaking about immutability the data once dropped into an rdd is immutable because the access provided by our DD is just re only the only way to access or modified is by applying a transformation on to an rdd which is prior to the present one discussing about partitioning. The important reason for Sparks. Parallel processing is its part issue. By default spot determines the number of Parts into which your data is divided, but you can override this and decide the number of blocks. You want to split your data. Let's see what persistence is Sparks are it is a totally reusable. The users can apply certain number of Transformations on to an rdd and preserve the final Oddity for future use this avoids all the hectic process of applying all the Transformations from scratch and now last but not the least course crane operations.
The operations performed on rdds using Transformations like map filter flat map Etc change the arteries and update them. Hence. Every operation applied onto an RTD is course trained. These are the features of rdds and moving on to the next stage. We shall understand. The creation of rdds art. It is can be created using three methods. The first method is using parallelized collections. Next method is by using external storage like hdfs hbase. Hi. And many more the third one is using an existing ID, which is prior to the present one. Now, let us see understand and create an array D through each method now Spa can be run on Virtual machines like spark VM or you can install a Linux operating system like Ubuntu and run it Standalone, but we here at Erica use the best-in-class cloud lab which comprises of all the Frameworks. You needed a single stop Cloud framework. No need of any hectic. Has of downloading any file or setting up an environment variables and looking for a hardware specification Etc. All you need is a login ID and password to the all-in-one ready to use cloud lab where you can run and save all your programs.
Let us fire up our spark shell using the command spark to – shell now as partial is been fired up. Let's create a new rdd. So here we are creating a new RTD with the first method which is using the parallelized collections here. We are creating a new rdt by the name parallelized collections are ready. We are starting a spark context and we have paralyzing an array into the rdd which consists of the data of the days of a week, which is Monday Tuesday, Wednesday, Thursday, Friday and Saturday. Now, let's create this our new rdd paralyzed collections rdd is successfully created now, let's display the data which is present in our RTD. So this was the data which is present in our RTD now, let's create a new ID using a second method. The second method of creating an rdd was using an external storage such as hdfs high SQL and many more here. I'm creating a new rdd by the name spark file where I'll be loading a text document into the rdd from an external storage, which is hdfs.
And this is the location where my text file is located. So the new rdd spark file is successfully created now, let's display the data which is present in as pack file a TD. It's the data which is present in as pack file ID is a collection of alphabets starting from A to Z. Now. Let's create a new already using the third method which is using an existing iridium, which is prior to the present one in the third method. I'm creating a new Rd by the name verts and I'm creating a spark context and paralyzing a statement into the RTD Words, which is spark is a very powerful language. So this is a collection of Words, which I have passed into the new. You are DD words. Now. Let us apply a transformation on to the RTD and create a new artery through that. So here I'm applying map transformation on to the previous rdd that is words and I'm storing the data into the new ID which is WordPress.
So here we are applying map transformation in order to display the first letter of each and every word which is stored in the RTD words. Now, let's continue. The transformation is been applied successfully now, let's display the contents which are present in new ID which is word pair So as explained we have displayed the starting letter of each and every word as s is starting letter of spark is starting letter of East and so on L is starting letter of language. Now, we have understood the creation of a dedes. Let us move on to the next stage where we'll understand the operations that are performed on rdds Transformations and actions are the two major operations that are performed on added. He's let us understand what our Transformations we applied.
Summations in order to access filter and modify the data which is present in an rdd. Now Transformations are further divided into two types narrow Transformations and why Transformations now, let us understand what our narrow Transformations we apply narrow Transformations onto a single partition of parent ID because the data required to process the RTD is available on a single partition of parent additi the examples for neurotransmission our map filter. At map partition and map partitions. Let us move on to the next type of Transformations which is why Transformations. We apply why Transformations on to the multiple partitions of parent a greedy because the data required to process an rdd is available on multiple partitions of the parent additi the examples for why Transformations are reduced by and Union now, let us move on to the next part which is actions actions on the other hand are considered to be the next part of operations, which are used to display the final.
The examples for actions are collect count take and first till now we have discussed about the theory part on rdd. Let us start executing the operations that are performed on a disease. In a practical part will be dealing with an example of IPL match stata. So here I have a CSV file which has the IPL match records and this CSV file is stored in my hdfs and I'm loading. My batch is dot CSV file into the new rdd, which is CK file as a text file. So the match is dot CSV file is been successfully loaded as a text file into the new ID, which is CK file now, let us display the data which is present in our seek a file using an action command. So collect is the action command which I'm using in order to display the data which is present in my CK file a DD. So here we have in total six hundred and thirty six rows of data which consists of IPL match records from the year 2008 to 2017. Now, let us see the schema of a CSV file. I am using the action command first in order to display the schema of a match is dot CSV file.
So this command will display the starting line of the CSV file. We have so the schema of a CSV file is the ID of the match season city where the IPL match was conducted date of the match team one team two and so on now, let's perform the further operations on a CSV file. Now moving on to the further operations. I'm about to split the second column of my CSV file which consists the information regarding the states which conducted the IPL matches. So I am using this operation in order to display the states where the matches were conducted. So the transformation is been successfully applied and the data has been stored into the new ID which is States. Now, let's display the data which is stored in our state's rdd using the collection action command, so these with The states where the matches were being conducted now, let's find out the city which conducted the maximum number of IPL matches. Yeah, I'm creating a new ID again, which is States count and I'm using map transformation and I am counting each and every city and the number of matches conducted in that particular City.
The transformation is successfully applied and the data has been stored into the account ID. Now. Let us create a new editing by name State count em and apply reduced by key transformation and map transformation together and consider topple one as the city name and toppled to as the Number of matches which were considered in that particular City and apply sort by K transformation to find out the city which conducted maximum number of IPL matches. The Transformations are successfully applied and the data is being stored into the state count. Em RTD now let's display the data which is present in state count. Em, I did here I am using take action command in order to take the top 10 results which are stored in state count MRDD. So according to the results we have Mumbai which Get the maximum number of IPL matches, which is 85 since the year 2008 to the year 2017.
Now let us create a new ID by name fil ardi and use flat map in order to filter out the match data which were conducted in the city Hydra path and store the same data into the file rdd since transformation is been successfully applied now, let us display the data which is present in our fil ardi which consists of the matches which were conducted excluding the city Hyderabad. So this is the data which is present in our fil ardi D which excludes the matches which are played in the city Hyderabad now, let us create another rdd by name fil and store the data of the matches which were conducted in the year 2017. We shall use filter transformation for this operation. The transformation is been applied successfully and the data has been stored into the fil ardi now, let us display the data which is present there. Michelle use collect action command and now we have the data of all the matches which your plate especially in the year 2070.
similarly, we can find out the matches which were played in the year 2016 and we can save the same data into the new rdd which is fil to Similarly, we can find out the data of the matches which were conducted in the year 2016 and we can store the same data into our new rdd which is fil to I have used filter transformation in order to filter out the data of the matches which were conducted in the year 2016 and I have saved the data into the new RTD which is a file to now, let us understand the union transformation which will apply the union transformation on to the fil ardi and fil to rdd. In order to combine both the data is present in both The Oddities here. I'm creating a new rdd by the name Union rdd and I'm applying Union transformation on the to Oddities that we created before. The first one is fil ardi which consists of the data of the matches played in the year 2017.
And the second one is a file to which consists the data of the matches. Which up late in the year 2016 here I'll be clubbing both the R8 is together and I'll be saving the data into the new rdd. Which is Union rdd. Now let us display the data which is present in a new array, which is Union rgd. I am using collect action command in order to display the data. So here we have the data of the matches which were played in the u.s. 2016 and 2017. And now let's continue with our operations and find out the player with maximum number of man of the match awards for this operation. I am applying map transformation and splitting out the column number 13, which consists of the data of the players who won the man of the match awards for that particular match. So the transformation is been successfully applied and the column number 13 is been successfully split and the data has been stored into the man of the match our DD now. We are creating a new rdd by the named man of the match count me applying map Transformations on to a previous rdd and we are counting the number of awards won by each and every particular player.
Now, we shall create a new ID by the named man of the match and we are applying reduced by K. Under the previous added which is man of the match count. And again, we are applying map transformation and considering topple one as the name of the player and topple to as the number of matches. He played and won the man of the match Awards, let us use take action command in order to print the data which is stored in our new RTD which is man of the match. So according to the result we have a bws who won the maximum number of man of the matches, which is 15. So these are the few operations that were performed on rdds. Now, let us move on to our Pokémon use case so that we can understand our duties in a much better way. So the steps to be performed in Pokémon use cases are loading the Pokemon data dot CSV file from an external storage into an rdd removing the schema from the Pokémon data dot CSV file and finding out the total number of water type Pokemon finding the total number of fire type Pokemon. I know it's getting interesting. So let me explain you each and every step practically.
So here I am creating a new identity by name Pokemon data rdd one and I'm loading my CSV file from an external storage. That is my hdfs as a text file. So the Pokemon data dot CSV file is been successfully loaded into our new rdd. So let us display the data which is present in our Pokémon data rdd one. I am using collect action command for this. So here we have 721 rows of data of all the types of Pokemons we have So now let us display the schema of the data we have I have used the action command first in order to display the first line of a CSV file which happens to be the schema of a CSV file. So we have index of the Pokemon name of the Pokémon. Its type total points HP attack points defense points special attack special defense speed generation, and we can also find if a particular Pokemon is legendary or not. Here, I'm creating a new RTD which is no header and I'm using filter operation in order to remove the schema of a Pokemon data dot CSV file. The schema of Pokemon data dot CSV file is been removed because the spark considers the schema as a data to be processed.
So for this reason, we remove the schema now, let's display the data which is present in a no-hitter rdd. I am using action command collect in order to display the data which is present in no header rdd. So this is the data which is stored in a no-hitter rdd without the schema. So now let us find out the number of partitions into which are no header are ready is been split in two. So I am using partitions transformation in order to find out the number of partitions. The data was split in two according to the result. The no header rdd is been split into two partitions. I am here creating a new rdt by name water rdd and I'm using filter transformation in order to find out what a type Pokemons in our Pokémon data dot CSV file. I'm using action command collect in order to print the data which is present in water rdd. So these are the total number of water type Pokemon that we have in our Pokémon data dot CSV. Similarly. Let's find out the fire type Pokemons.
I'm creating a new identity by the name fire RTD and applying filter operation in order to find out the fire type Pokemon present in our CSV file. I'm using collect action command in order to print the data which is present in fire rdd. So these are the fire type Pokemon which are present in our Pokémon data dot CSV file. Now, let us count the total number of water type Pokemon which are present in a Pokemon data dot CSV file. I am using count action for this and we have 112 water type Pokemon is present in our Pokémon data dot CSV file. Similarly. Let's find out the total number of fire-type Pokémon as we have I'm using count action command for the same. So we have a total 52 number of fire type Pokemon Sinnoh Pokemon data dot CSV files. Let's continue with our further operations where we'll find out a highest defense strength of a Pokémon.
I am creating a new ID by the name defense list and I'm applying map transformation and spreading out the column number six in order to extract the defense points of all the Pokemons present in our Pokémon data dot CSV file. So the data is been stored successfully into a new era. DD which is defenseless. Now. I'm using Mac's action command in order to print out the maximum different strengths out of all the Pokemons. So we have 230 points as the maximum defense strength amongst all the Pokemons. So in our further operations, let's find out the Pokemons which come under the category of having highest different strengths, which is 230 points. In order to find out the name of the Pokemon with highest defense strength. I'm creating a new identity with the name.
It defense with Pokemon name and I'm applying May transformation on to the previous array, which is no header and I'm splitting out column number six which happens to be the different strengths in order to extract the data from that particular row, which has the defense strength as 230 points. Now I'm creating a new RTD again with the name maximum defense Pokemon and I'm applying group bike a transformation in order to display the Pokemon which have the maximum defense points that is 230 points. So according to the result. We have Steelix Steelix Mega chacal Aggregate and aggregate Mega as the Pokemons with highest different strengths, which is 230 points. Now we shall find out the Pokemon which is having least different strengths. So before we find out the Pokemon with least different strengths, let us find out the least defense points which are present in the defense list. So in order to find out the Pokémon with least different strengths, I have created a new rdt by name minimum defense Pokemon and I have applied distinct and sort by Transformations on to the defense list rdd in order to extract the least defense points present in the defense list and I have used take action command in order to display the data which is present in minimum defense Pokemon rdd.
So according to the results, we have five points as the least defense strength of a particular Pokémon now, let us find out the name of the On which comes under the category of having Five Points as different strengths now, let us create a new rdd which is difference Pokemon name to and apply my transformation and split the column number 6 and store the data into our new rdd which is defense with Pokemon name, too. The transformation is been successfully applied and the data is now stored into the new rdd which is defense with Pokemon name to the data is been successfully loaded. Now, let us apply the further operations here. I am creating another rdd with name minimum defense Pokemon and I'm applying group bike a transformation in order to extract the data from the row which has the defense points as 5.
0. The data is been successfully loaded now and let us display. The data which is present in minimum defense Pokemon rdd now according to the results. We have to number of Pokemons, which come under the category of having Five Points as that defense strength the Pokemons chassis knee and happening at the to Pokemons, which I have in the least definition the world of Information Technology and big data processing started to see multiple potentialities from spark coming into action. Such Pinnacle in Sparks technology advancements is the data frame. And today we shall understand the technicalities of data frames and Spark a data frame and Spark is all about performance. It is a powerful multifunctional and an integrated data structure where the programmer can work with different libraries and perform numerous functionalities without breaking a sweat to understand apis and libraries involved in the process without wasting any time. Let us understand a topic for today's discussion.
I line up the docket for understanding. Data frames and Spark is below which will begin with what our data frames here. We will learn what exactly a data frame is. How does it look like and what are its functionalities then we shall see why do we need data frames here? We shall understand the requirements which led us to the invention of data frames later. I'll walk you through the important features of data frames. Then we should look into the sources from which the data frames and Spark get their data from Once the theory part is finished. I will get us involved into the Practical part where the creation of a dataframe happens to be a first step next we shall work with an interesting example, which is related to football and finally to understand the data frames in spark in a much better way we should work with the most trending topic as I use case, which is none other than the Game of Thrones. So let's get started.
What is a data frame in simple terms a data frame can be considered as a distributed collection of data. The data is organized under named columns, which provide us The operations to filter group process and aggregate the available data data frames can also be used with Sparks equal and we can construct data frames from structured data files rdds or from an external storage like hdfs Hive Cassandra hbase and many more with this we should look into a more simplified example, which will give us a basic description of a data frame. So we shall deal with an employee database where we have entities and their data types. So the name of the employee is a first entity And its respective data type is string data type similarly employee ID has data type of string employee phone number which is integer data type and employ address happens to be string data type. And finally the employee salary is float data type. All this data is stored into an external storage, which may be hdfs Hive or Cassandra using the data frame API with their respective schema, which consists of the name of the entity along with this data type now that we have understood what exactly a data frame is.
Let us quickly move on to our next stage where we shall understand the requirement for a data frame. It provides as multiple programming language support ability. It has the capacity to work with multiple data sources, it can process both structured and unstructured data. And finally it is well versed with slicing and dicing the data. So the first one is the support ability for multiple programming languages. The IT industry is required a powerful and an integrated data structure which could support multiple programming languages and at the same. Same time without the requirement of additional API data frame was the one stop solution which supported multiple languages along with a single API the most popular languages that a dataframe could support our our python. Skaila, Java and many more the next requirement was to support the multiple data sources.
We all know that in a real-time approach to data processing will never end up at a single data source data frame is one such data structure, which has the capability to support and process data. From a variety of data sources Hadoop Cassandra. Json files hbase. CSV files are the examples to name a few. The next requirement was to process structured and unstructured data. The Big Data environment was designed to store huge amount of data regardless of which type exactly it is now Sparks data frame is designed in such a way that it can store a huge collection of both structured and unstructured data in a tabular format along with its schema. The next requirement was slicing In in dicing data now, the humongous amount of data stored in Sparks data frame can be sliced and diced using the operations like filter select group by order by and many more these operations are applied upon the data which are stored in form of rows and columns in a data frame these with a few crucial requirements which led to the invention of data frames.
Now, let us get into the important features of data frames which bring it an edge over the other alternatives. Immutability lazy evaluation fault tolerance and distributed memory storage, let us discuss about each and every feature in detail. So the first one is immutability similar to the resilient distributed data sets the data frames and Spark are also immutable the term immutable depicts that the data was stored into a data frame will not be altered. The only way to alter the data present in a data frame would be by applying simple transformation operations on to them. So the next feature is lazy evaluation. Valuation lazy evaluation is the key to the remarkable performance offered by spark similar to the rdds data frames in spark will not throw any output onto the screen until and unless an action command is encountered.
The next feature is Fault tolerance. There is no way that the Sparks data frames can lose their data. They follow the principle of being fault tolerant to the unexpected calamities which tend to destroy the available data. The next feature is distributed storage Sparks dataframe distribute the data. Most multiple locations so that in case of a node failure the next available node can takes place to continue the data processing. The next stage will be about the multiple data source that the spark dataframe can support the spark API can integrate itself with multiple programming languages such as scalar Java python our MySQL and many more making itself capable to handle a variety of data sources such as Hadoop Hive hbase Cassandra, Json file. As CSV files my SQL and many more. So this was the theory part and now let us move into the Practical part where the creation of a dataframe happens to be a first step. So before we begin the Practical part, let us load the libraries which required in order to process the data in data frames.
So these are the few libraries which we required before we process the data using our data frames. Now that we have loaded all the libraries which we required to process the data using the data frames. Let us begin with the creation of our data frame. So we shall create a new data frame with the name employee and load the data of the employees present in an organization. The details of the employees will consist the first name the last name and their mail ID along with their salary. So the First Data frame is been successfully created now, let us design the schema for this data frame. So the schema for this data frame is been described as shown the first name is of string data type and similarly. The last name is a string data type along with the mail address. And finally the salary is integer data type or you can give flow data type also, so the schema has been successfully delivered now, let us create the data frame using Create data frame function here. I'm creating a new data frame by starting a spark context and using the create data frame method and loading the data from Employee and employer schema.
The data frame is successfully created now, let's print the data which is existing in the dataframe EMP DF. I am using show method here. So the data which is present in EMB DF is been successfully printed now, let us move on to the next step. So the next step for our today's discussion is working with an example related to the FIFA data set. So the first step in our FIFA example would be loading the schema for the CSV file. We are working with so the schema has been successfully loaded now. Now let us load the CSV file from our external storage which is hdfs into our data frame, which is FIFA DF. The CSV file is been successfully loaded into our new data frame, which is FIFA DF now, let us print the schema of a data frame using the print schema command. So the schema is been successfully displayed here and we have the following credentials. Of each and every player in our CSV file now, let's move on to a further operations on a dataframe. We will count the total number of records of the play as we have in our CSV file using count command.
So we have a total of eighteen thousand to not seven players in our CSV files. Now, let us find out the details of the columns on which we are working with. So these were the columns which we are working with which consists the idea of the player name age nationality potential and many more. Now let us use the column value which has the value of each and every player for a particular T and let us use describe command in order to see the highest value and the least value provided to a player. So we have account of a total number of 18,000 to not seven players and the minimum worth given to a player is 0 and the maximum is given as 9 million pounds. Now, let us use the select command in order to extract the column name and nationality. How to find out the name of each and every player along with his nationality. So here we have we can display the top 20 rows of each and every player which we have in our CSV file along with us nationality. Similarly.
Let us find out the players playing for a particular Club. So here we have the top 20 Place playing for their respective clubs along with their names for example messy playing for Barcelona and Ronaldo for Juventus and Etc. Now, let's move to the next stages. No, let us find out the players who are found to be most active in a particular national team or a particular club with h less than 30 years. We shall use filter transformation to apply this operation. So here we have the details of the Players whose age is less than 30 years and their club and nationality along with their jersey numbers. So with this we have finished our FIFA example now to understand the data frames in a much better way, let us move on into our use case, which is about the most Hot Topic The Game of Thrones. Similar to our previous example, let us design the schema of a CSV file first. So this is the schema for a CSV file which consists the data about the Game of Thrones.
So, this is a schema for our first CSV file. Now, let us create the schema for our next CSV file. I have named the schema for our next CSV file a schema to and I've defined the data types for each and every entity the scheme has been successfully designed for the second CSV file also. Now let us load our CSV files from our external storage, which is our hdfs. The location of the first CSV file character deaths dot CSV is our hdfs, which is defined as above and the schema is been provided as schema. And the header true option is also been provided. We are using spark read function for this and we are loading this data into our new data frame, which is Game of Thrones data frame. Similarly. Let's load the other CSV file which is battles dot CSV into another data frame, which is Game of Thrones Butters dataframe the CSV file. Has been successfully loaded now. Let us continue with the further operations.
Now let us print the schema offer Game of Thrones data frame using print schema command. So here we have the schema which consists of the name alliances death rate book of death and many more similarly. Let's print the schema of Game of Thrones Butters data frame. So this is a schema for our new data frame, which is Game of Thrones battle data frame. Now, let's continue the further operations. Now, let us display the data frame which we have created using the following command data frame has been successfully printed and this is the data which we have in our data frame. Now, let's continue with the further operations. We know that there are a multiple number of houses present in the story of Game of Thrones. Now, let us find out each and every individual house present in the story. Let us use the following command in order to display each and every house present in the Game of Thrones story.
So we have the following houses in the Game of Thrones story. Now, let's continue with the further operations the battles in the Game of Thrones were fought for ages. Let us classify the vast waste with their occurrence according to the years. We shall use select and filter transformation and we shall access The Columns of the details of the battle and the year in which they were fought. Let us first find out the battles which were fought in the year. R 298 the following code consists of filter transformation which will provide the details for which we are looking. So according to the result. These were the battles were fought in the year 298 and we have the details of the attacker Kings and the defender Kings and the outcome of the attacker along with their commanders and the location where the war was fought now, let us find out the wars based in the air 299. So these with the details of the verse which were fought in the year 299 and similarly, let us also find out the bars which are waged in the year 300.
So these were the words which were fought in the year 300. Now, let's move on to the next operations in our use case. Now, let us find out the tactics used in the wars waged and also find out the total number of vast waste by using each type of those tactics the following code must help us. Here we are using select and group by operations in order to find out each and every type of tactics used in the war. So they have used Ambush sees raising and Pitch type of tactics inverse and most of the times they have used pitched battle type of tactics inverse. Now, let us continue with the further operations the Ambush type of battles are the deadliest now, let us find out the Kings who fought the battles using these kind of tactics and also let us find out the outcome of the battles fought here the In code will help us extract the data which we need here. We are using select and we're commands and we are selecting The Columns year attacking Defender King attacker outcome battle type attacker Commander defend the commander now, let us print the details. So these were the battles fought using the Ambush tactics and these were the attacker Kings and the defender Kings along with their respective commanders and the wars waste in a particular year now.
Let's move on to the next operation. Now let us focus on the houses and extract the deadliest house amongst the rest. The following code will help us to find out the deadliest house and the number of patents the wage. So here we have the details of each and every house and the battles the waged according to the results. We have stuck and Lannister houses to be the deadliest among the others. Now, let's continue with the rest of the operations. Now, let us find out the deadliest king among the others which will use the following command in order to find the deadliest king amongst the other kings who fought in the A number of Firsts. So according to the results we have Joffrey as the deadliest King who fought a total number of 14 battles. Now, let us continue with the further operations.
Now, let us find out the houses which defended most number of Wars waste against them. So the following code must help us find out the details. So according to the results. We have Lannister house to be defending the most number of paths based against them. Now, let us find out the defender King who defend it most number of battles which were waste against him So according to the result drop stack is the king who defended most number of patterns which waged against him. Now. Let's continue with the further operations. Since Lannister house is my personal favorite. Let me find out the details of the characters in Lannister house. This code will describe their name and gender one for male and 0 for female along with their respective population. So let me find out the male characters in The Lannister house first. So here we have used select and we're commanded. Ends in order to find out the details of the characters present in Lannister house and the data is been stored into tf1 dataframe.
Let us print the data which is present in idea of one data frame using show command. So these are the details of the characters present in Lannister house, which are made now similarly. Let us find out the female character is present in Lannister house. So these are the characters present in Lannister house who are females so we have a total number of 69 male characters and 12 number of female characters in The Lannister house. Now, let us continue with the next operations at the end of the day every episode of Game of Thrones had a noble character. Let us now find out all the noble characters amongst all the houses that we have in our Game of Thrones CSV file the following code must help us find out the details. So the details of all the characters from all the houses who are considered to be Noble. I've been saved into the new data frame, which is DF 3 now, let us print the details from the df3 data frame. So these are the top 20 members from all the houses who are considered to be Noble along with their genders.
Now, let us count the total number of noble characters from the entire game of thrones stories. So there are a total of four hundred and thirty number of noble characters existing in the whole game of throne story. Nonetheless, we have also faced a few Communists whose role in The Game of Thrones is found to be exceptional vision of find out the details of all those commoners who were highly dedicated to their roles in each episode the data of all, the commoners is been successfully loaded into the new data frame, which is TFO now let us print the data which is present in the DF for using the show command. So these are the top 20 characters identified as common as amongst all the Game of Thrones stories. Now, let us find out the count of total number of common characters. So there are a total of four hundred and eighty seven common characters amongst all stories of Game of Thrones. Let us continue with the further operations.
Now they were a few rows who were considered to be important and equally Noble, hence. They were carried out under the last book. So let us filter out those characters and find out the details of each one of them. The data of all the characters who are considered to be Noble and carried out until the last book are being stored into the new data frame, which is TFO now let us print the data which is existing in the data frame for so according to the results. We have two candidates who are considered to be the noble and their character is been carried on until the last book amongst all the battles. I found the battles of the last books to be generating more adrenaline in the readers. Let us find out the details of those battles using the following code. So the following code will help us to find out the bars which were fought in the last year's of the Game of Thrones. So these are the details of the vast which are fought in the last year's of the Game of Thrones and the details of the Kings and the details of their commanders and the location where the war was fought. Welcome to this interesting session of Sparks SQL tutorial from a drecker.
So in today's session, we are going to learn about how we will be working. Spock sequent now what all you can expect from this course from this particular session so you can expect that. We will be first learning by Sparks equal. What are the libraries which are present in Sparks equal. What are the important features of Sparkle? We will also be doing some Hands-On example and in the end we will see some interesting use case of stock market analysis now Rice Park sequel is it like Why we are learning it why it is really important for us to know about this Sparks equal sign. Is it like really hot in Market? If yes, then why we want all those answer from this. So if you're coming from her do background, you must have heard a lot about Apache Hive now what happens in Apache. I also like in Apache Hive SQL developers can write the queries in SQL way and it will be getting converted to your mapreduce and giving you the out. Now we all know that mapreduce is lower in nature.
And since mapreduce is going to be slower and nature then definitely your overall high score is going to be slower in nature. So that was one challenge. So if you have let's say less than 200 GB of data or if you have a smaller set of data. This was actually a big challenge that in Hive your performance was not that great. It also do not have any resuming capability stuck. You can just start it also. – cannot even drop your encrypted data bases. That's was also one of the challenge when you deal with the security side. Now what sparks equal have done it Sparks equal have solved almost all of the problem. So in the last sessions you have already learned about the smart way right House Park is faster from mapreduce and not we have already learned that in the previous few sessions now.
So in this session, we are going to kind of take a live range of all that so definitely in this case since This pack is faster because of the in-memory computation. What is in memory competition? We have already seen it. So in memory computations is like whenever we are Computing anything in memory directly. So because of in memory competition capability because of arches purpose poster. So definitely your spark SQL is also been to become first know so if I talk about the advantages of Sparks equal over Hive definitely number one it is going to be faster in Listen to your hive so a high quality, which is let's say you're taking around 10 minutes in Sparks equal. You can finish that same query in less than one minute. Don't you think it's an awesome capability of subsequent definitely as right now second thing is when if let's say you are writing something and – now you can take an example of let's say a company who is let's say developing – queries from last 10 years.
Now they were doing it. There were all happy that they were able to process picture. That they were worried about the performance that Hive is not able to give them a that level of processing speed what they are looking for. Now this fossil. It's a challenge for that particular company. Now, there's a challenge right? The challenge is they came to know know about subsequent fine. Let's say we came to know about it, but they came to know that we can execute everything is Park Sequel and it is going to be faster as well fine. But don't you think that if these companies working for net set past 10 years? In Hive they must have already written lot of Gordon – now if you ask them to migrate to spark SQL is will it be until easy task? No, right. Definitely. It is not going to be an easy task. Why because Hive syntax and Sparks equals and X though. They boot tackle the sequel way of writing the things but at the same time it is always a very it carries a big difference, so there will be a good difference whenever we talk about the syntax between them. So it will take a very good amount of time for that company to change all of the query mode to the Sparks equal way now Sparks equal came up with a smart salvation what they said is even if you are writing the query with – you can execute that Hive query directly through subsequent don't you think it's again a very important and awesome facility, right? Because even now if you're a good Hive developer, you need not worry about that how you will be now that migrating to Sparks.
Well, you can still keep on writing to the hive query and can your query will automatically be getting converted to spot sequel with similarly in Apache spark as we have learned in the past sessions, especially through spark streaming that Sparks. The aiming is going to make you real time processing right? You can also perform your real-time processing using a purchase. / now. This sort of facility is you can take leverage even you know Sparks ago. So let's say you can do a real-time processing and at the same time you can also Perform your SQL query now the type that was the problem. You cannot do that because when we talk about Hive now in – it's all about Hadoop is all about batch processing batch processing where you keep historical data and then later you process it, right? So it definitely Hive also follow the same approach in this case also high risk going to just only follow the batch processing mode, but when it comes to a purchase, but it will also be taking care of the real-time processing.
So how all these things happens so Our Park sequel always uses your meta store Services of your hive to query the data stored and managed by – so in when you were learning about high, so we have learned at that time that in hives everything. What we do is always stored in the meta Stone so that met Esther was The crucial point, right? Because using that meta store only you are able to do everything up. So like when you are doing let's say or any sort of query when you're creating a table, everything was getting stored in that same metal Stone. What happens Spock sequel also use the same metal Stone now is whatever metal store. You have created with respect to Hive same meta store. You can also use it for your Sparks equal and that is something which is really awesome about this spark sequent that you did not create a new meta store. You need not worry about a new storage space and not everything what you have done with respect to your high same method you can use it. Now. You can ask me then how it is faster if they're using cymatics don't remember.
But the processing part why high was lower because of its processing way because it is converting everything to the mapreduce and this it was making the processing very very slow. But here in this case since the processing is going to be in memory computation. So in Sparks equal case, it is always going to be the faster now definitely it just because of the meta store site. We are only able to fetch the data are not but at the same time for any other thing of the processing related stuff, it is always going to be At the when we talk about the processing stage it is going to be in memory does it's going to be faster. So let's talk about some success stories of Sparks equal. Let's see some use cases Twitter sentiment analysis. If you go through over if you want sexy remember our spark streaming session, we have done a Twitter sentiment analysis, right? So there you have seen that we have first initially got the data from Twitter and that to we have got it with the help of Sparks Damon and later what we did later.
We just analyze everything with the help of spot. Oxycodone so you can see an advantage as possible. So in Twitter sentiment analysis where let's say you want to find out about the Donald Trump, right? You are fetching the data every tweet related to the Donald Trump and then kind of bring analysis in checking that whether it's a positive with negative tweet neutral tweet, very negative with very positive to it. Okay, so we have already seen the same example there in that particular session. So in this session, as you are noticing what we are doing we just want to kind of so that once you're streaming the data and the real time you can also do it. Also, seeing using spark sequel just you are doing all the processing at the real time similarly in the stock market analysis. You can use Park sequel lot of bullies.
You can adopt the in the banking fraud case Transitions and all you can use that. So let's say your credit card current is getting swipe in India and in next 10 minutes if your credit card is getting swiped in let's say in u.s. Definitely that is not possible. Right? So let's say you are doing all that processing real-time. You're detecting everything with respect to sparsely me. Then you are let's say applying your Sparks equal to verify that Whether it's a user Trend or not, right? So all those things you want to match up as possible. So you can do that similarly the medical domain you can use that. Let's talk about some Sparks equal features. So there will be some features related to it. Now, you can use what happens when this sequel got combined with this path. We started calling it as Park sequel now when definitely we are talking about SQL be a talking about either a structure data or a semi-structured data now SQL queries cannot deal with the unstructured data, so that is definitely one of Thing you need to keep in mind.
Now your spark sequel also support various data formats. You can get a data from pocket. You must have heard about Market that it is a columnar based storage and it is kind of very much compressed format of the data what you have but it's not human readable. Similarly. You must have heard about Jason Avro where we keep the value as a key value pair. Hi Cassandra, right? These are nosql TVs so you can get all the data from these sources now. You can also convert your SQL queries to your A derivative so you can you can you will be able to perform all the transformation steps. So that is one thing you can do. Now if we talk about performance and scalability definitely on this red color graph. If you notice this is related to your Hadoop, you can notice that red color graph is much more encompassing to blue color and blue color denotes my performance with respect to Sparks equal so you can notice that spark SQL is performing much better in comparison to your Hadoop. So we are on this Y axis. We are taking the running.
On the x-axis. We were considering the number of iteration when we talk about Sparks equal features. Now few more features we have for example, you can create a connection with simple your jdbc driver or odbc driver, right? These are simple drivers being present. Now, you can create your connection with his path SQL using all these drivers. You can also create a user defined function means let's say if any function is not available to you and that gives you can create your own functions. Let's say if function Is available use that if it is not available, you can create a UDF means user-defined function and you can directly execute that user-defined function and get your dessert sir. So this is one example where we have shown that you can convert. Let's say if you don't have an uppercase API present in subsequent how you can create a simple UDF for a and can execute it. So if you notice there what we are doing let's get this is my data. So if you notice in this case, this is data set is my data part. So this is I'm generating as a sequence.
I'm creating it as a data frame see this 2df part here. Now after that we are creating a / U DF here and notice we are converting any value which is coming to my upper case, right? We are using this to uppercase API to convert it. We are importing this function and then what we did now when we came here, we are telling that okay. This is my UDF. So UDF is upper by because we have created here also a zapper. So we are telling that this is my UDF in the first step and then Then when we are using it, let's say with our datasets what we are doing so data sets. We are passing year that okay, whatever. We are doing convert it to my upper developer you DFX convert it to my upper case. So see we are telling you we have created our / UDF that is what we are passing inside this text value. So now it is just getting converted and giving you all the output in your upper case way so you can notice that this is your last value and this is your uppercase value, right? So this got converted to my upper case in this particular.
Love it. Now. If you notice here also same steps. We are how to we can register all of our UDF. This is not being shown here. So now this is how you can do that spark that UDF not register. So using this API, you can just register your data frames now similarly, if you want to get the output after that you can get it using this following me so you can use the show API to get the output for this Sparks equal at attacher. Let's see that so what is Park sequel architecture now is Park sequel architecture if we talked about so what happens to your let 's say getting the data of with using your various formats, right? So let's say you can get it from your CSP. You can get it from your Json format. You can also get it from your jdbc format. Now, they will be a data source API.
So using data source API, you can fetch the data after fetching the data you will be converting to a data frame where so what is data frame. So in the last one we have learned that that when we were creating everything is already what we were doing. So, let's say this was my Cluster, right? So let's say this is machine. This is another machine. This is another machine, right? So let's say these are all my clusters. So what we were doing in this case now when we were creating all these things are as were cluster what was happening here. We were passing Oliver values him, right? So let's say we were keeping all the data. Let's say block B1 was there so we were passing all the values and work creating it in the form of in the memory and we were calling that as rdd now when we were walking in SQL we have to store the the data which is a table of data, right? So let's say there is a table which is let's say having column details.
Let's say name age. Let's say here. I have some value here are some value here. I have some value here at some value, right? So let's say I have some value of this table format. Now if I have to keep this data into my cluster what you need to do, so you will be keeping first of all into the memory. So you will be having let's say name H this column to test first of all year and after that you will be having some details of this. Perfect. So let's say this much data, you have some part in the similar kind of table with some other values will be here also, but here also you are going to have column details. You will be having name H some more data here. Now if you notice this is sounding similar to our DD, but this is not exactly like our GD right because here we are not only keeping just the data but we are also studying something like a column in a storage right? We also the keeping the column in all of it.
Data nodes or we can call it as if Burke or not, right? So we are also keeping the column vectors along with the rule test. So this thing is called as data frames. Okay, so that is called your data frame. So that is what we are going to do is we are going to convert it to a data frame API then using the data frame TSS or by using Sparks equal to H square or you will be processing the results and giving the output we will learn about all these things in detail. So, let's see this Popsicle libraries now there are multiple apis available. This like we have data source API we have data frame API. We have interpreter and Optimizer and SQL service. We will explore all this in detail. So let's talk about data source appear if we talk about data source API what happens in data source API, it is used to read and store the structured and unstructured data into your spark SQL. So as you can notice in Sparks equal we can give fetch the data using multiple sources like you can get it from hive take Cosette.
Inverse ESP Apache BSD base Oracle DB so many formats available, right? So this API is going to help you to get all the data to read all the data store it where ever you want to use it. Now after that your data frame API is going to help you to convert that into a named Colin and remember I just explained you that how you store the data in that because here you are not keeping like I did it. You're also keeping the named column as well as Road it is That is the difference coming up here. So that is what it is converting. In this case. We are using data frame API to convert it into your named column and rows, right? So that is what you will be doing. So at it also follows the same properties like your IDs like your attitude is Pearl easily evaluated in all same properties will also follow up here. Okay now interpret an Optimizer and interpreter and Optimizer step what we are going to do.
So, let's see if we have this data frame API, so we are going to first create this name. Column then after that we will be now creating an rdd. We will be applying our transformation step. We will be doing over action step right to Output the value. So all those things where it is happens it happening in The Interpreter and optimizes them. So this is all happening in The Interpreter and optimism. So this is what all the features you have. Now, let's talk about SQL service now in SQL service what happens it is going to again help you so it is just doing the order. Formation action the last day after that using spark SQL service, you will be getting your spark sequel outputs. So now in this case whatever processing you have done right in terms of transformations in all of that so you can see that your sparkers SQL service is an entry point for working along the structure data in your aperture spur.
Okay. So it is going to kind of help you to fetch the results from your optimize data or maybe whatever you have interpreted before so that is what it's doing. So this kind of completes. This whole diagram now, let us see that how we can perform a work queries using spark sequin. Now if we talk about spark SQL queries, so first of all, we can go to spark cell itself engine execute everything. You can also execute your program using spark your Eclipse also directing from there. Also, you can do that. So if you are let's say log in with your spark shell session. So what you can do, so let's say you have first you need to import this because into point x you must have heard that there is something called as Park session which came so that is what we are doing. So in our last session we have Have you learned about all these things are now Sparkstation is something but we're importing after that. We are creating sessions path using a builder function. Look at this. So This Builder API you we are using this Builder API, then we are using the app name. We are providing a configuration and then we are telling that we are going to create our values here, right? So we had that's why we are giving get okay, then we are importing all these things right once we imported after that we can say that okay.
We were want to read this Json file. So this implies God or Jason we want to read up here and in the end we want to Output this value, right? So this d f becomes my data frame containing store value of my employed or Jason. So this decent value will get converted to my data frame. We're now in the end PR just outputting the result now if you notice here what we are doing, so here we are first of all importing your spark session same story. We just executing it. Then we are building our things better in that. We're going to create that again. We are importing it then we are reading Json file by using Red Dot Json API. We are reading never employed or Jason. Okay, which is present in this particular directory and we are outputting so can you can see that Json format will be the T value format. But when I'm doing this DF not show it is just showing up all my values here.
Now. Let's see how we can create our data set. Now when we talk about data set, you can notice what we're doing. Now. We have understood all this stability the how we can create a data set now first of all in data set what we do so So in data set we can create the plus you can see we are creating a case class employ right now in case class what we are doing we are done just creating a sequence in putting the value Andrew H like name and age column. Then we are displaying our output all this data set right now. We are creating a primitive data set also to demonstrate mapping of this data frames to your data sets. Right? So you can notice that we are using to D's instead of 2 DF. We are using two DS in this case. Now, you may ask me what's the difference with respect to data frame, right? With respect to data frame in data frame what we were doing. We were create again the data frame and data set both exactly looks safe.
It will also be having the name column in rows and everything up. It is introduced lately in 1.6 versions and later. And what is it provides it it provides a encoder mechanism using which you can get when you are let's say reading the weight data back. Let's say you are DC realizing you're not doing that step, right? It is going to be faster. So the performance wise data set is better. That's the reason it is introduced later nowadays. People are moving from data frame two data sets Okay. So now we are just outputting in the end see the same thing in the output. But so we are creating employ a class. Then we are putting the value inside it creating a data set. We are looking at the values, right? So these are the steps we have just understood them now how we can read of a Phi so we want to read the file. So we will use three dot Json as employee employee was what remember case class which we have created last thing. This was the classic we have created your case class employee.
So we are telling that we are creating like this. We are just out putting this value. We just within shop you can see this way. We can see this output. Also now, let's see how we can add the schema to rdd now in order to add the schema to rdd what we are going to do. So in this case also, you can look at we are importing all the values that we are importing all the libraries whatever are required then after that we are using this spark context text by reading the data splitting it with respect to comma then mapping the attributes. We will employ The case that's what we have done and putting converting this values to integer. So in then we are converting to to death right after that. We are going to create a temporary viewer table. So let's create this temporary view employ. Then we are going to use part dot Sequel and passing up our SQL query. Can you notice that we have now passing the value and we are assessing this employ, right? We are assessing this employee here.
Now, what is this employ this employee was of a temporary view which we have created because the challenge in Sparks equalist when Whether you want to execute any SQL query you cannot say select aesthetic from the data frame. You cannot do that. There's this is not even supported. So you cannot do select extract from your data frame. So instead of that what we need to do is we need to create a temporary table or a temporary view so you can notice here. We are using this create or replace temp You by replace because if it is already existing override on top of it. So now we are creating a temporary table which will be exactly similar to mine this data frame now you You can just directly execute all the query on your return preview Autumn Prairie table. So you can notice here instead of using employ DF which was our data frame. I am using here temporary view. Okay, then in the end, we just mapping the names and a right and we are outputting the bells.
That's it. Same thing. This is just an execution part of it. So we are just showing all the steps here. You can see in the end. We are outputting all this value now how we can add the schema to rdd. Let's see this transformation step now in this case you Notice that we can map this youngster fact the we're converting this map name into the string for the transformation part, right? So we are checking all this value that okay. This is the string type name. We are just showing up this value right now. What were you doing? We are using this map encoder from the implicit class, which is available to us to map the name and Each pie. Okay. So this is what we're going to do because remember in the employee is class. We have the name and age column that we want to map now.
Now in this case, we are mapping the names to the ages. Has so you can notice that we are doing for ages of our younger CF data frame that what we have created earlier and the result is an array. So the result but you're going to get will be an array with the name map to your respective ages. You can see this output here so you can see that this is getting map. Right. So we are getting seeing this output like name is John it is 28 that is what we are talking about. So here in this case, you can notice that it was representing like this in this case. The output is coming out in this particular format now, let's talk about how Can add the schema how we can read the file we can add a whiskey minor so we will be first of all importing the type class into your passion. So with this is what we have done by using import statement. Then we are going to import the row class into this partial. So rho will be used in mapping our DB schema. Right? So you can notice we're importing this also then we are creating an rdd called as employ a DD.
So in case this case you can notice that the same priority we are creating and we are creating this with the help of this text file. So once we have create this we are going to Define our schema. So this is the scheme approach. Okay. So in this case, we are going to Define it like named and space than H. Okay, because they these were the two I have in my data also in this employed or tht if you look at these are the two data which we have named NH. Now what we can do once we have done that then we can split it with respect to space. We can say that our mapping value and we are passing it all this value inside of a structure. Okay, so we are defining a burn or fields are ready. That is what we are doing. See this the fields are ready, which is going to now output after mapping the employee ID.
Okay, so that is what we are doing. So we want to just do this into my schema strength, then in the end. We will be obtaining this field. If you notice this field what we have created here. We are obtaining this into a schema. So we are passing this into a struct type and it is getting converted to be our scheme of it. So that is what we will do. You can see all this execution same steps. We are just executing in this terminal now, Let's see how we are going to transform the results. Now, whatever we have done, right? So now we have already created already called row editing. So let's create that Rogue additive are going to Gray and we want to transform the employee ID using the map function into row already. So let's do that. Okay. So in this case what we are doing so look at this employed reading we are splitting it with respect to coma and after that we are telling see remember we have name and then H like this so that's what you're telling me telling that act. Zero or my attributes one and why we're trimming it just inverted to ensure if there is no spaces and on which other so those things we don't want to unnecessarily keep up.
So that's the reason we are defining this term statement. Now after that after we once we are done with this, we are going to define a data frame employed EF and we are going to store that rdd schema into it. So now if you notice this row ID, which we have defined here and schema which we have defined in the last case right now if you'll go back and notice here. Schema, we have created here right with respect to my Fields. So that schema and this value what we have just created here rowady. We are going to pass it and say that we are going to create a data frame. So this will help us in creating a data frame now, we can create our temporary view on the base of employee of let's create an employee or temporary View and then what we can do we can execute any SQL queries on top of it.
So as you can see SparkNotes equal we can create all the SQL queries and can directly execute that now what we can do. We want to Output the values we can quickly do that. Now. We want to let's say display the names of we can say Okay, attribute 0 contains the name we can use the show command. So this is how we will be performing the operation in the scheme away now, so this is the same output way means we're just executing this whole thing up. You can notice here. Also, we are just saying attribute 0.0. It is representing or me my output now, let's talk about Json data. Now when we talk about Json data, let's talk about how we can load our files and work on. This so in this case, we will be first. Let's say importing our libraries. Once we are done with that. Now after that we can just say that retort Jason we are just bringing up our employed or Jason you see this is the execution of this part now similarly, we can also write back in the pocket or we can also read the value from parque.
You can notice this if you want to write let's say this value employee of data frame to my market way so I can sit right dot right dot market. So this will be created employed or Park. Be created and hear all the values should be converted to employed or packet. Only thing is the data. If you go and see in this particular directory, this will be a directory. We should be getting created. So in this data, you will notice that you will not be able to read the data. So in that case because it's not human readable. So that's the reason you will not be able to do that. So, let's say you want to read it now so you can again bring it back by using Red Dot Market you are reading this employed at pocket, which I just created then you are creating a temporary view or temporary table and then By using standard SQL you can execute on your temporary table.
Now in this way. You can read your pocket file data and in then we are just displaying the result see the similar output of this. Okay. This is how we can execute all these things up now. Once we have done all this, let's see how we can create our data frames. So let's create this file path. So let's say we have created this file employed or Jason after that we can create a data frame from our Json path, right? So we are creating this by using retouch Jason then we can Print the schema. What does to this is going to print the schema of my employee data frame? Okay, so we are going to use this print schemer to print up all the values then we can create a temporary view of this data frame. So we are create doing that see create or replace temp you we are creating that which we have seen it last time also now after that we can execute our SQL query. So let's say we are executing our SQL query from employee where age is between 18 and 30, right? So this kind of SQL query.
Let's say we want to do we can get that And in the end we can see the output Also. Let's see this execution. So you can see that all the vampires who these are let's say between 18 and 30 that is showing up in the output. Now. Let's see this rdd operation way. Now what you can do so we are going to create this add any other employer Nene now which is going to store the content of employed George and New Delhi Delhi. So see this part, so here we are creating this by using make a DD and we have just this is going to store the content containing Such from noodle, right? You can see this so New Delhi is my city named state is the ring. So that is what we are passing inside it. Now what we are doing we are assigning the content of this other employee ID into my other employees.
So we are using this dark dot RI dot Json and we are reading at the value and in the end we are using this show appear. You can notice this output coming up now. Let's see with the hive table. So with the hive table if you want to read that, so let's do it with the case class and Spark sessions. So first of all, we are going to import a guru class and we are going to use path session into the Spartan. So let's do that for a way. I'm putting this row this past session and not after that. We are going to create a class record containing this key which is of integer data type and a value which is of string type. Then we are going to set our location of the warehouse location. Okay to this pathway rows. So that is what we are doing. Now. We are going to build a spark sessions back to demonstrate the hive example in spots equal. Look at this now, so we are creating Sparks session dot Builder again. We are passing the Any app name to it we have passing the configuration to it. And then we are saying that we want to enable The Hive support now once we have done that we are importing this spark SQL library center.
And then you can notice that we can use SQL so we can create now a table SRC so you can see create table if not exist as RC with column to stores the data as a key common value pair. So that is what we are doing here. Now, you can see all this execution of the same step. Now. Let's see the sequel operation happening here now in this case what we can do. We can now load the data from this example, which is present to succeed. Is this KV m dot txt file, which is available to us and we want to store it into the table SRC which we have just created and now if you want to just view the all this output becomes a sequence select aesthetic form SRC and it is going to show up all the values you can see this output. Okay. So this is the way you can show up the virus now similarly we can perform the count operation.
Okay, so we can say select Counter-Strike from SRC to select the number of keys in there. See tables, and now select all the records, right so we can say that key select key gamma value so you can see that we can perform all over Hive operations here on this right similarly. We can create a data set string DS from spark DF so you can see this also by using SQL DF what we already have we can just say map and then provide the case class in can map the ski common value pair and then in the end we can show up all this value see this execution of this in then you can notice this output which we want it now. Let's see the result back. But now we can create our data frame here. Right so we can create our data frame records deaf and store all the results which contains the value between 1 200. So we are storing all the values between 1/2 and video. Then we are creating a victim Prairie View. Okay for the records, that's what we are doing. So for requires the FAA creating a temporary view so that we can have over Oliver SQL queries now, we can execute all the values so you can also notice we are doing join operation here.
Okay, so we can display the content of join between the records and this is our city. We can do a joint on this part so we can also perform all the joint operations and get the output. Now. Let's see our use case for it. If we talk about use case. We are going to analyze our stock market with the help of spark sequence select understand the problem statement first. So now in our problem statement, so what we want to do so we want to accept definitely everybody must be aware of this top market like in stock market. You can lot of activities happen. You want to know analyze it in order to make some profit out of it and all those stuff. Alright, so now let's say our company have collected a lot of data for different 10 companies and they want to do some computation. Let's say they want to compute the average closing price. They want to list the companies with the highest closing prices. They want to compute the average closing price per month.
They want to list the number of big price Rises and fall and compute some statistical correlation. So these things we are going to do with the help of our spark SQL statement. So this is a very common we want to process the huge data. We want to handle The input from the multiple sources, we want to process the data in real time and it should be easy to use. It should not be very complicated. So all this requirement will be handled by my spots equal right? So that's the reason we are going to use the spacer sequence. So as I said that we are going to use 10 companies. So we are going to kind of use this 10 companies and on those ten companies. We are going to see that we are going to perform our analysis on top of it. So we will be using this table data from Yahoo finance for all this following stocks. So for n and a A bit sexist. So all these companies we have on on which we are going to perform. So this is how my data will look like which will be having date opening High rate low rate closing volume adjusted close.
All this data will be presented now. So, let's see how we can Implement a stock analysis using spark sequel. So what we have to do for that, so this is how many data flow diagram will sound like so we have going to initially have the huge amount of real-time stock data that we are going to process it through this path SQL. So going to It into a named column base. Then we are going to create an rdd for functional programming. So let's do that. Then we are going to use a reverse Park sequel which will calculate the average closing price for your calculating. The company with is closing per year then buy some stock SQL queries will be getting our outputs. Okay, so that is what we're going to do. So all the queries what we are getting generated, so it's not only this we are also going to compute few other queries what we have solve those queries.
We're going to execute him. Now. This is how the flow will look like. So we are going to initially have this Data what I have just shown you a now what you're going to do. You're going to create a data frame you are going to then create a joint clothes are ready. We will see what we are going to do here. Then we are going to calculate the average closing price per year. We are going to hit a rough patch SQL query and get the result in the table. So this is how my execution will look like. So what we are going to do in this case, first of all, we are going to initialize the Sparks equal in this function. We are going to import all the required libraries then we are going to start our spark session after importing all the required. B we are going to create our case class whatever is required in the case class, you can notice a then we are going to Define our past stock scheme. So because we have already learnt how to create a schema as we're going to create this page table schema by creating this way.
Well, then we are going to Define our parts. I DD so in parts are did if you notice so here we are creating. This parts are ready mix. We have going to create all of that by using this additive first. We are going to remove the header files also from it. Then we are going to read our CSV file into Into stocks a a on DF data frame. So we are going to read this as C dot txt file. You can see we are reading this file and we are going to convert it into a data frame. So we are passing it as an oddity. Once we are done then if you want to print the output we can do it with the help of show API. Once we are done with this now we want to let's say display the average of addressing closing price for n and for every month, so if we can do all of that also by using select query, right so we can say this data frame dot select and pass whatever parameters are required to get the average know, You can notice are inside this we are creating the Elias of the things as well.
So for this DT, we are creating areas here, right? So we are creating the Elias for it in a binder and we are showing the output also so here what we are going to do now, we will be checking that the closing price for Microsoft. So let's say they're going up by 2 or with greater than 2 or wherever it is going by greater than 2 and now we want to get the output and display the result so you can notice that wherever it is going to be greater than 2 we are getting the value. So we are hitting the SQL query to do that. So we are hitting the SQL query now on this you can notice the SQL query which we are hitting on the stocks. Msft. Right? This is the we have data frame we have created now on this we are doing that and we are putting our query that where my condition this to be true means where my closing price and my opening price because let's say at the closing price the stock price by let's say a hundred US Dollars and at that time in the morning when it open with the Lexi 98 used or so, wherever it is going to be having a different. Of to or greater than to that only output we want to get so that is what we're doing here.
Now. Once we are done then after that what we are going to do now, we are going to use the join operation. So what we are going to do so we will be joining the Annan and except bestop's in order to compare the closing price because we want to compare the prices so we will be doing that. So first of all, we are going to create a union of all these stocks and then display this guy joint Rose. So look at this what we're going to do we're going to use the spark sequence and if you notice this closely what we're doing in this case, So now in this park sequel, we are hitting the square is equal and all those stuff then we are saying from this and here we are using this joint operation may see this join oppression. So this we are joining it on and then in the end we are outputting it. So here you can see you can do a comparison of all these clothes price for all these talks.
You can also include no for more companies right now. We have just shown you an example with to complete but you can do it for more companies as well. Now in this case if you notice what we're doing were writing this in the park a file format and Save Being into this particular location. So we are creating this joint stock market. So we are storing it as a packet file format and here if you want to read it we can read that and showed output but whatever file you have saved it as a pocket while definitely you will not be able to read that up because that file is going to be the perfect way and park it way are the files which you can never read. You will not be able to read them up now, so you will be seeing this average closing price per year.
I'm going to show you all these things running also some just right to explaining you how things will be run. We're doing up here. So I will be showing you all these things in execution as well. Now in this case, if you notice what we are doing again, we are creating our data frame here. Again, we are executing our query whatever table we have. We are executing on top of it. So in this case because we want to find the average closing per year. So what we are doing in this case, we are going to create a new table containing the average closing price of let's say an and fxn first and then we are going to display all this new table. So we are in the end. We are going to register this table or The temporary table so that we can execute our SQL queries on top of it. So in this case, you can notice that we are creating this new table. And in this new table, we have putting our SQL query right that SQL query is going to contains the average closing Paso the SQL queries finding out the average closing price of N and all these companies then whatever we have now.
We are going to apply the transformation step not transformation of this new table, which we have created with the year and the corresponding three company data what we have created into the The company or table select which you can notice that we are creating this company or table and here first of all, we are going to create a transform table company or and going to display the output so you can notice that we are hitting the SQL query and in the end we have printing this output similarly if we want to let's say compute the best of average close we can do that. So in this case again the same way now, if once they have learned the basic stuff, you can notice that everything is following a similar approach now in this case also, we want to find out let's say the best of the average So we are creating this best company here now. It should contain the best average closing price of an MX and first so we can just get this greatest and all battery.
So we creating that then after that we are going to display this output and we will be again registering it as a temporary table now, once we have done that then we can hit our queries now, so we want to check let's say best performing company per year. Now what we have to do for that. So we are creating the final table in which we are going to compute all the things we are going to perform the join or not. So although SQL query we are going to perform here in order to compute that which company is doing the best and then we are going to display the output. So this is what the output is going showing up here. We are again storing as a comparative View and here again the same story of correlation what we're going to do here. So now we will be using our statistics libraries to find the correlation between Anand epochs companies closing price. So that is what we are going to do now.
So correlation in finance and the investment and industries is a statistics. Measures the degree to which to Securities move in relation to each other. So the closer the correlation is to be 1 this is going to be a better one. So it is always like how to variables are correlated with each other. Let's say your H is highly correlated to your salary, but you're earning like when you are young you usually unless and when you are more Edge definitely you will be earning more because you will be more mature similar way I can say that. Your salary is also dependent on your education qualification. And also on the premium Institute from where you have done your education. Let's say if you are from IIT, or I am definitely your salary will be higher from any other campuses. Right Miss. It's a probability. We what I'm telling you.
So let's say if I have to correlate now in this case the education and the salary but I can easily create a correlation, right? So that is what the correlation go. So we are going to do all that with respect to Overstock analysis. Now now what we are doing in this case, so You can notice we are creating this series one where we heading the select query now, we are mapping all this an enclosed price. We are converting to a DD similar way for Series 2. Also we are doing that right. So this is we are doing for rabbits or earlier. We have done it for an enclosed and then in the end we are using the statistics dot core to create a correlation between them. So you can notice this is how we can execute everything now. Let's go to our VM and see everything in our execution. Question from at all. So this VM how we will be getting you will be getting all this VM from a director. So you need not worry about all that but that how I will be getting all this p.
m. In a so a once you enroll for the courses and also you will be getting all this came from that Erika said so even if I am working on Mac operating system my VM will work. Yes every operating system. It will be supported. So no trouble you can just use any sort of VM in all means any operating system to do that. So what I would occur do is they just don't want You to be troubled in any sort of stuff here. So what they do is they kind of ensure that whatever is required for your practicals. They take care of it. That's the reason they have created their own VM, which is also going to be a lower size and compassion to Cloudera hortonworks VM and this is going to definitely be more helpful for you.
So all these things will be provided to you question from nothing. So all this project I am going to learn from the sessions. Yes. So once you enroll for so right now whatever we have seen definitely we have just Otten upper level of view of this how the session looks like for a purchase. But but when we actually teach all these things in the course, it's usually are much more in the detailed format. So in detail format, we kind of keep on showing you each step in detail that how the things are working even including the project. So you will be also learning with the help of project on each different topic. So that is the way we kind of go for it. Now if I am stuck in any other project then who will be helping me so they will be a support team 24 by 7 if Get stuck at any moment. You need to just give a call and kit and a call or email. There is a support ticket and immediately the technical team will be helping across the support team is 24 by 7.
They are they are all technical people and they will be assisting you across on all that even the trainers will be assisting you for any of the technical query great. Awesome. Thank you now. So if you notice this is my data we have we were executing all the things on this data. Now what we want to do if you notice this is the same code which I have just shown you. Earlier also now let us just execute this code. So in order to execute this what we can do we can connect to my spa action. So let's get connected to suction. Someone's will be connected to Spur action. We will go step by step. So first we will be importing our package. This take some time let it just get connected. Once this is connected now, you can notice that I'm just importing all the all the important libraries we have already learned about that. After that, you will be initialising your spark session. So let's do that again the same steps what you have done before. Once we will be done.
We will be creating a stock class. We could have also directly executed from Eclipse. Also, this is just I want to show you step-by-step whatever we have learnt. So now you can see for company one and then if you want to do some computation we want to even see the values and all right, so that's what we're doing here. So if we are just getting the files creating another did, you know, so let's execute this. Similarly for your a back similarly for your fast for all this so I'm just copying all these things together because there are a lot of companies for which we have to do all this step. So let's bring it for all the 10 companies which we are going to create. So as you can see, this print scheme has giving it output right now. Similarly. I can execute for a rest of the things as well.
So this is just giving you the similar way. All the outputs will be shown up here company for company V all these companies you can see this in execution. After that, we will be creating our temporary view so that we can execute our SQL queries. So let's do it for complaint and also then after that we can just create a work all over temporary table for it. Once we are done now we can do our queries. Like let's say we can display the average of existing closing price for and and for each one so we can hit this query. So all these queries will happen on your temporary view because we cannot anyway to all these queries on our data frames are out so you can see this this is getting executed. Trying it out to Tulsa now because they've done dot shoe. That's the reason you're getting this output.
Similarly. If we want to let's say list the closing price for msft which went up more than $2 way. So that query also we can execute now we have already understood this query in detail. It is seeing is execution partner so that you can appreciate whatever you have learned. See this is the output showing up to you. Now after that how you can join all the stack closing price right similar way how we can save the joint view in the packet for table. You want to read that back. You want to create a new table like so let's execute all these three queries together because we have already seen this. Look at this. So this in this case, we are doing the drawing class basing this output. Then we want to save it in the package files. We are saving it and we want to again reiterate back. Then we are creating our new table, right? We were doing that join and on so that is what we are doing in this case. Then you want to see this output.
Then we are against touring as a temp table or not. Now. Once we are done with this step also then what so we have done it in Step 6. Now we want to perform. Let's have a transformation on new table corresponding to the three companies so that we can compare we want to create the best company containing the best average closing price for all these three companies. We want to find the companies but the best closing price average per year. So let's do all that as well. So you can see best company of the year now here also the same stuff we are doing to be registering over temp table. Okay, so there's a mistake here. So if you notice here it is 1 but here we are doing a show of all right, so there is a mistake. I'm just correcting it.
So here also it should be 1 I'm just updating in the sheet itself so that it will start working now. So here I have just made it one. So now after that it will start working. Okay, wherever it is going to be all I have to make it one. So that is the change which I need to do here also. And you will notice it will start working. So here also you need to make it one. So all those places where ever it was so just kind of a good point to make so wherever you are working on this we need to always ensure that all these values what you are putting up here. Okay, so I could have also done it like this one second. In fact in this place. I need not do all this step one second. Let me explain you also why no in this place. It's So see from here this error started opening why because my data frame what I have created here most one.
Let's execute it. Now, you will notice this Quest artwork. See this is working. Now. After that. I am creating a temp table that temp table. What we are creating is let's say company on okay. So this is the temp table which we have created. You can see this company now in this case if I am keeping this company on itself it is going to work. Because here anyway, I'm going to use the whatever temporary table we have created, right? So now let's execute. So you can see now it started book. No further to that now, we want to create a correlation between them so we can do that. See this is going to give me the correlation between the two column names and so that we can see here. So this is the correlation the more it is closer to 1 means the better it is it means definitely it is near to 1 it is 0.9, which is a bigger value.
So definitely it is going to be much they both are highly correlated means definitely they are impacting each other stock price. So this is all about the project but Welcome to this interesting session of spots remaining from and Erica. What is pathogenic? Is it like really important? Definitely? Yes. Is it really hot? Definitely? Yes. That's the reason we are learning this technology. And this is one of the very sort things in the market when it's a hot thing means in terms of job market I'm talking about. So let's see what will be our agenda for today. So we are going to Gus about spark ecosystem where we are going to see that okay, what is pop how smarts the main threats in the West Park ecosystem wise path streaming we are going to have overview of stock streaming kind of getting into the basics of that.
We will learn about these cream. We will learn also about these theme Transformations. We will be learning about caching and persistence accumulators broadcast variables checkpoints. These are like Advanced concept of paths. And then in the end, we will walk through a use case of Twitter sentiment analysis. Now, what is streaming let's understand that. So let me start by us example to you. So let's see if there is a bank and in Bank. Definitely. I'm pretty sure all of you must have views credit card debit card all those karts what dance provide now, let's say you have done a transaction. From India just now and within an art and edit your card is getting swept in u.s. Is it even possible for your car to vision and arduous definitely know now how that bank will realize that it is a fraud connection because Bank cannot let that transition happen.
They need to stop it at the time of when it is getting swiped either. You can block it. Give a call to you ask you whether It is a genuine transaction or not. Do something of that sort. Now. Do you think they will put some manual person behind the scene that will be looking at all the transaction and you will block it manually. No, so they require something of the sort where the data will be getting stream. And at the real time they should be able to catch with the help of some pattern. They will do some processing and they will get some pattern out of it with if it is not sounding like a genuine transition. They will immediately add a block it I'll give you a call maybe send me an OTP to confirm whether it's a genuine connection dot they will not wait till the next day to kind of complete that transaction.
Otherwise if what happened nobody is going to touch that that right. So that is the how we work on stomach. Now someone have mentioned that without stream processing of data is not even possible. In fact, we can see that there is no And big data which is possible. We cannot even talk about internet of things. Right and this this is a very famous statement from Donna Saint do from C equals 3 lot of companies like YouTube Netflix Facebook Twitter iTunes topped Pandora. All these companies are using spark screaming. Now. What is this? We have just seen with an example to kind of got an idea. Idea about steaming pack. Now as I said with the time growing with the internet doing these three main Technologies are becoming popular day by day.
It's a technique to transfer the data so that it can be processed as a steady and continuous drip means immediately as and when the data is coming you are continuously processing it as well. In fact, this real-time streaming is what is driving to this big data and also internet of things now, they will be lot of things like fundamental unit of streaming media streams. We will also be Transforming Our screen. We will be doing it. In fact, the companies are using it with their business intelligence. We will see more details in further of the slides. But before that we will be talking about spark ecosystem when we talk about Spark mmm, there are multiple libraries which are present in a first one is pop frequent now in spark SQL is like when you can SQL Developer can write the query in SQL way and it is going to get converted into a spark way and then going to give you output kind of analogous to hide but it is going to be faster in comparison to hide when we talk about sports clinic that is what we are going to learn it is going to enable all the analytical and Practical applications for your live streaming data M11. Ml it is mostly for machine learning.
And in fact, the interesting part about MLA is that it is completely replacing mom invited are almost replaced. Now all the core contributors of Mahal have moved in two words the towards the MLF thing because of the faster response performance is really good. In MLA Graphics Graphics. Okay. Let me give you example everybody must have used Google Maps right now. What you doing Google Map you search for the path. You put your Source you put your destination. Now when you just search for the part, it's certainly different paths and then provide you an optimal path right now how it providing the optimal party. These things can be done with the help of Graphics. So wherever you can create a kind of a graphical stuff. Up, we will say that we can use Graphics spark up. Now. This is the kind of a package provided for art. So R is of Open Source, which is mostly used by analysts and now spark committee won't infect all the analysts kind of to move towards the sparkling water. And that's the reason they have recently stopped supporting spark on we are all the analysts can now execute the query using spark environment that's getting better performance and we can also work on Big Data. That's that's all about the ecosystem point below this we are going to have a core engine for engine is the one which defines all the basics of the participants all the RGV related stuff and not is going to be defined in your staff for Engine moving further now, so as we have just discussed this part we are going to now discuss past screaming indicate which is going to enable analytical and Interactive.
For live streaming data know Y is positive if I talk about bias past him indefinitely. We have just gotten after different is very important. That's the reason we are learning it but this is so powerful that it is used now for the by lot of companies to perform their marketing they kind of getting an idea that what a customer is looking for. In fact, we are going to learn a use case of similar to that where we are going to to use pasta me now where we are going to use a Twitter sentimental analysis, which can be used for your crisis management. Maybe you want to check all your products on our behave service. I just think target marketing by all the companies around the world. This is getting used in this way. And that's the reason spark steaming is gaining the popularity and because of its performance as well. It is beeping on other platforms. At the moment now moving further.
Let's eat Sparks training features when we talk about Sparks training teachers. It's very easy to scale. You can scale to even multiple nodes which can even run till hundreds of most speed is going to be very quick means in a very short time. You can scream as well as processor data soil tolerant, even it made sure that even you're not losing your data integration. You with your bash time and real-time processing is possible and it can also be used for your business analytics which is used to track the behavior of your customer. So as you can see this is super polite and it's like we are kind of getting to know so many interesting things about this pasta me now next quickly have an overview so that we can get some basics of spots. Don't know let's understand.
Which box? So as we have just discussed it is for real-time streaming data. It is useful addition in your spark for API. So we have already seen at the base level. We have that spark or in our ecosystem on top of that we have passed we will impact Sparks claiming is kind of adding a lot of advantage to spark Community because a lot of people are only joining spark Community to kind of use this pasta me. It's so powerful. Everyone wants to come and want to use it because all the other Frameworks which we already have which are existing are not as good in terms of performance in all and and it's the easiness of moving Sparks coming is also great if you compare your program for let's say two orbits from which is used for real-time processing. You will notice that it is much easier in terms of from a developer point of your ass that that's the reason a lot of regular showing interest in this domain now, it will also enable Table of high throughput and fault-tolerant so that you to stream your data to process all the things up and the fundamental unit Force past dreaming is going to be District. What is this thing? Let me explain it.
So this dream is basically a series of bodies to process the real-time data. What we generally do is if you look at this light inside you when you get the data, It is a continuous data you divide it in two batches of input data. We are going to call it as micro batch and then we are going to get that is of processed data though. It is real time. But still how come it is back because definitely you are doing processing on some part of the data, right? Even if it is coming at real time. And that is what we are going to call it as micro batch. Moving further now. Let's see few more details on it. Now from where you can get all your data. What can be your data sources here. So if we talk about data sources here now we can steal the data from multiple sources like Market of the past events. You have statuses like at based mongodb, which are you know, SQL babies elasticsearch post Vis equal pocket file format you can Get all the data from here.
Now after that you can also don't do processing with the help of machine learning. You can do the processing with the help of your spark SQL and then give the output. So this is a very strong thing that you are bringing the data using spot screaming but processing you can do by using some other Frameworks as well. Right like machine learning you can apply on the data what you're getting fatter years time. You can also apply your spots equal on the data, which you're getting at. the real time Moving further. So this is a single thing now in Sparks giving you what you can just get the data from multiple sources like from cough cough prove sefs kinases Twitter bringing it to this path screaming doing the processing and storing it back to your hdfs. Maybe you can bring it to your DB you can also publish to your UI dashboard. Next Tableau angularjs lot of UI dashboards are there in which you can publish your output now. Holly quotes, let us just break down into more fine-grained gutters. Now we are going to get our input data stream. We are going to put it inside of a spot screaming going to get the batches of input data. Once it executes to his path engine.
We are going to get that chest of processed data. We have just seen the same diagram before so the same explanation for it. Now again breaking it down into more glamour part. We are getting a d string B string was what Vulnerabilities of data multiple set of Harmony, so we are getting a d string. So let's say we are getting an rdd and the rate of time but because now we are getting real steam data, right? So let's say in today right now. I got one second. Maybe now I got some one second in one second. I got more data now I got more data in the next not Frank. So that is what we're talking about. So we are creating data. We are getting from time 0 to time what we get say that we have an RGB at the rate of Timbre similarly it is this proceeding with the time that He's getting proceeded here. Now in the next thing we extracting the words from an input Stream So if you can notice what we are doing here from where let's say, we started applying doing our operations as we started doing our any sort of processing.
So as in when we get the data in this timeframe, we started being subversive. It can be a flat map operation. It can be any sort of operation you're doing it can be even a machine-learning opposite of whatever you are doing and then you are generating the words in that kind of thing. So this is how we as we're seeing that how gravity we can kind of see all these part at a very high level this work. We again went into detail then again, we went into more detail. And finally we have seen that how we can even process the data along the time when we are screaming our data as well. Now one important point is just like spark context is mean entry point for any spark application similar. Need to work on streaming a spot screaming you require a streaming context. What is that when you're passing your input data stream you when you are working on the Spark engine when you're walking on this path screaming engine, you have to use your system in context of its using screaming context only you are going to get the batches of your input data now so streaming context is going to consume a stream of data in In Apache spark, it is registers and input D string to produce or receiver object.
Now it is the main entry point as we discussed that like spark context is the main entry point for the spark application. Similarly. Your streaming context is an entry point for yourself Paxton. Now does that mean now Spa context is not an entry point know when you creates pastrini it is dependent. On your spots community. So when you create this thing in context it is going to be dependent on your spark of context only because you will not be able to create swimming contest without spot Pockets. So that's the reason it is definitely required spark also provide a number of default implementations of sources, like looking in the data from Critter a factor 0 mq which are accessible from the context. So it is supporting so many things, right? now If you notice this what we are doing in streaming contact, this is just to give you an idea about how we can initialize our system in context. So we will be importing these two libraries after that. Can you see I'm passing spot context SE right son passing it every second. We are collecting the data means collect the data for every 1 second. You can increase this number if you want and then this is your SSC means in every one second what ever gonna happen? I'm going to process it.
And what we're doing in this place, let's go to the D string topic now now in these three it is the full form is discretized stream. It's a basic abstraction provided by your spa streaming framework. It's appointing a stream of data and it is going to be received from your source and from processed steaming context is related to your response living Fun Spot context is belonging. To your spark or if you remember the ecosystem radical in the ecosystem, we have that spark context right now streaming context is built with the help of spark context. And in fact using streaming context only you will be able to perform your sponsoring just like without spark context you will not able to execute anything in spark application just park application will not be able to do anything similarly without streaming content. You're streaming application will not be able to do anything.
It just that screaming context is built on top of spark context. Okay, so it now it's a continuous stream of data we can talk about these three. It is received from source of on the processed data speed generated by the transformation of interesting. If you look at this part internally a these thing can be represented by a continuous series of I need these this is important. Now what we're doing is every second remember last time we have just seen an example of like every second whatever going to happen. We are going to do processing. So in that every second whatever data you are collecting and you're performing your operation. So the data what you're getting here is will be your District means it's a Content you can say that all these things will be your D string point. It's our Representation by a continuous series of kinetic energy so many hundred is getting more because let's say right knocking one second.
What data I got collected. I executed it. I in the second second this data is happening here. Okay? Okay. Sorry for that. Now in the second time also the it is happening a third second. Also it is happening here. No problem. No, I'm not going to do it now fine. So in the third second Auto if I did something I'm processing it here. Right. So if you see that this diagram itself, so it is every second whatever data is getting collected. We are doing the processing on top of it and the whole countenance series of RDV what we are seeing here will be called as the strip. Okay. So this is what your distinct moving further now we are going to understand the operation on these three. So let's say you are doing this operation on this dream that you are getting. The data from 0 to 1 again, you are applying some operation on that then whatever output you get you're going to call it as words the state means this is the thing what you're doing you're doing a pack of operation. That's the reason we're calling it is at what these three now similarly whatever thing you're doing. So you're going to get accordingly and output be screen for it as well.
So this is what is happening in this particular example now. Flat map flatmap is API. It is very similar to mac. Its kind of platen of your value. Okay. So let me explain you with an example. What is flat back? So let's say if I say that hi, this is a doulica. Welcome. Okay, let's say listen later. Now. I want to apply a flatworm. So let's say this is a form of rdd. Also now on this rdd, let's say I apply flat back to let's say our DB this is the already flat map. It's not map Captain black pepper. And then let's say you want to define something for it. So let's say you say that okay, you are defining a variable sale. So let's say a a DOT now after that you are defining your thoughts split split. We're splitting with respect to visit now in this case what is going to happen now? I'm not saying the exacting here just to give extremely flat back just to kind of give you an idea about box.
It is going to flatten up this fight with respect to the split what you are mentioned here means what it is going to now create each element as one word. It is going to create like this high as one what l 1 element this as one One element is ask another what a one-element adwaita as one water in the limit. Bentham has one vote for example. So this is how your platinum Works kind of flatten up your whole file. So this is what we are doing in our stream effort. We are our so this is how this will work. Now so we have just understood this part. Now, let's understand input the stream and receivers. Okay, what are these things? Let's understand this fight. Okay. So what are the input based impossible? They can be basic Source advances in basic Source we can have filesystems sockets Connections in advance Source we can have Kafka no Genesis. Okay.
So your input these things are under these things representing the stream of input data received from streaming sources. This is again the same thing. Okay. So this is there are two type of things which we just discussed. Is your basic and second is your advance? Let's move brother. Now what we are going to see each other. So if you notice let's see here. There are some events often it is going to your receiver and then energy stream now I will bees are getting created and we are performing some steps on it. So the receiver sends the data into the D string where each back is going to contain the RTD. So this is what you're this thing is doing receiver. Is doing here now moving further Transformations on the D string. Let's understand that.
What are the Transformations available? There are multiple Transformations, which are possibly the most popular. Let's talk about that. We have map flatmap filter reduce Group by so there are multiple Transformations available via now. It is like you are getting your input data now you will be applying any of these operations. Means any Transformations that is going to happen. And then on you this thing is going to be created. Okay, so that is what's going to happen. So let's explore it one by one. So let's start with now if I start with map what happens with Mac it is going to create that judges of data. Okay. So let's say it is going to create a map value of it like this. So let's say X is not to be my is giving the output Z that is giving the output X, right. So in this similar format, this is going to get mad.
That is going to whatever you're performing. It is just going to create batches of input data, which you can execute it. So it returns a new DC by fasting each element of the source D string through a function, which you have defined. Let's discuss this lapis that we have just discussed it is going to flatten up the things. So in this case, also, if you notice we are just kind of flat inner it is very similar to Mac. But each input item can be mapped to zero or more outputs in items here. Okay, and it is going to return a new these three bypassing each Source element to a function for this fight. So we have just seen an example of that crap anyway, so that seems awfully can remember 70 more easy for you to kind of see the difference between with markets has no moving further filter as the name States you can now filter out the values. So let's say you have a huge data you are kind of we want to filter out some values. You just want to kind of walk with some filter data.
Maybe you want to remove some part of it. Maybe you are trying to put some Logic on it. Does this line contains this right under this line? Is that so in that case extreme only with that particular criteria? So this is what we do here, but definitely most of the times to Output is going to be smaller in comparison to your input reduce reduce is it's just like it's going to do kind of aggregation on the wall. Let's say in the end you want to sum up all the data what you have that is going to be done with the help of reduce. Now after that group by group back is like it's going to combine all the common values that is what group by is going to do. So as you can see in this example all the things which are starting with Seagal broom back all the things we're starting with J. Boardroom back all the names starting with C got goodbye.
Not. So again, what is this screen window now to give you an example of this window? Everybody must be knowing Twitter, right? So now what happens in total? Let me go to my paint. So insert in this example, let's understand how this windowing of Asians of so, let's say in initials per second in the initial per second 10 seconds. Let's say the tweets are happening in this way. Let's say cash a hash a hashtag now, which is the trading Twitter definitely is right is my training good maybe in the next 10 seconds. In the next 10 seconds now again Hash A. Ashby. Ashby is open which is the trending with be happening here. Now. Let's say in another 10 seconds. Now this time let's say hash be hash be so actually I should be Hashmi zapping now, which is trendy be lonely. But now I want to find out which is the trending one in last 30. Ashley right because if I combine I can do it easily. Now this is your been doing operation example means you are not only looking at your current window, but you're also looking at your previous window Vanessa current window.
I'm talking about let's say 10 seconds of slot in this 10 seconds lat let's say you are doing this operation on has be has to be has to be has to be so this is a current window now you are not fully Computing with respect to your current window. But you are also considering your previous window. Now, let's say in this case. If I ask you, can you give me the output of which is trending in last 17 seconds? Will you be able to answer know why because you don't have partial information for 7 Seconds you have information for your 10 20 30 mins multiple of them, but not intermediate one. So keep this in mind. Okay, so you will be able to perform in doing operation only with respect to your window size. It's not like you can create any partial value in can do the window efficient now, let's get back to the sides. Now it's a similar thing. So now it is shown here that we are not only considering the current window, but we are also considering the previous window now next understand the output operators are operations of the business when we talk about output operations. The output operations are going to allow the D string data to be pushed out to your external system.
If you notice here means whenever whatever processing you have done with respect to what What data you are doing here now your output you can store in multiple base against original file system. You can keep in your database. You can keep it even in your external systems so you can keep in multiple places. So that is what being reflected here. Now, so if I talk about output operation, these are the one which are supported we can print out the value we can use save as text file menu save as take five. It saves it into your chest. If you want you can also use it to save it in the local pack system. You can save it as an object file. Also, you can save it as a Hadoop file or you can also apply for these are daily function. Now what are for each argument function? Let's see this example. So the mill Levy Spin on this part in detail Banks we teach you or in advocacy sessions, but just to give you an idea now.
This is a very powerful primitive that is going to allow your data to be sent out to your external systems. So using this you can send it across to your web server system. We have just seen our external system that we can give file system. It can be anything. So using this you will be able to transfer it. You can view will be able to send it out to your external systems. Now, let's understand the cash in and persistence now when we talk about caching and persistence, so these 3 Ms. Also annoying the developers to cash or to persist the streams data in the moral means you can keep your data in memory. You can cash your data in the morning for longer time. Even after your action is complete. It is not going to delete it so you can just Use this as many times as you want so you can simply use the first method to do that. So for your input streams which are receiving the data over the network may be using taskbar Loom sockets. The default persistence level is set to the replicate the data to two loads for the for tolerance like it is also going to be replicating the data into two parts so you can see the same thing in this diagram.
Let's understand this accumulators broadcast variables and checkpoints. Now, these are mostly for your performance. But so this is going to help you to kind of perform to help you in the performance partner. So it is accumulators is nothing but environment that are only added through and associative and commutative operation. Usually if you're coming from Purdue background if you have done let's say be mapreduce programming you must have seen something. Counters like that, they'll be used for other counters which kind of helps us to debug the program as well and you can perform some analysis in the console itself. Now this is similar to you can do with the accumulators as well. So you can Implement your contest with X open this part you can also some of the things with this fact now you can if you want to track through UI you can also do that as you can see in this UI itself. You can see all your excavators as well now similarly we have broadcast.
Erebus now broadcast Parables allows the programmer to keep your meat only bearable cast on all the machines which are available. Now it is going to be kind of cashing it on all the machines now, they can be used to give every note of copy of a large input data set in an efficient manner so you can just use that sparkle. Also attempt to distribute the distributed broadcast variable using efficient bra strap. I will do nothing to reduce the communication process. So as you can see here, we are passing this broadcast value it is going to spark contest and then it is broadcasting to this places. So this is what how it is working in this application. Generally when we teach in this class has and also since things are Advanced concept, we kind of we kind of try to expand you with the practicals are not right now. I just want to give you an idea about what are these things? So when you go with the practicals of all these things that how activator see how this is happening out is getting broadcasted Things become more and more fear at that time right now.
I just want that everybody at these data high level overview of things. Now moving further sub what is checkpoints so checkpoints are similar to your checkpoints in the gaming now, hold on they can they make it run 24/7 make it resilient to the failure and related to the application project. So if you can see this diagram, we are just creating the checkpoint. So as in the metadata checkpoint, you can see it is the saving of the information which is defining the streaming computation if we talk about data from check. It is saving of the generated a DD to the reliable storage. So this is this both are generating the checkpoint now now moving forward. We are going to move towards our project where we are going to perform our Twitter sentiment analysis. Let's discuss a very important Force case of Twitter sentiment analysis.
This is going to be very interesting because we will just do a real-time. This on Twitter sentiment analysis and they can be lot of possibility of this sentiment analysis will be but we will be taking something for the turtle and it's going to be very interesting. So generally when we do all this in know course, it is more detailed because right now in women are definitely going in deep is not very much possible, but during the training of a director, you will learn all these things within the trust awesome, right that's there something which we learned during the session. It's No, we talked about some use cases of Twitter. As I said there can be multiple use cases which are possible because there are solutions behind whatever the continue doing it so much of social media right now in these days are very active has been right. It must be noticing that even politicians have started using Twitter and their did all the treats are being shown in the news channel in cystic of a heart-rending to it because they are talking about positive negative in any politician use Something right? And if we talk about anything is even if we talk about let's any Sports FIFA World Cup is going on then you will notice always return will be filled up with lot of treatment.
So how we can make use of it how we can do some analysis on top of it that first we are going to learn in this so they can be multiple sort of our sentiment analysis think it can be done for your crisis Management Service. I just think target marketing we can keep on talking about when a new release release now even the moviemakers kind of glowing eyes. Okay, hold this movie is going to perform so they can easily make out of it beforehand. Okay, this movie is going to go in this kind of range of profit or not interesting day. I let us explore not to Impossible even in the political campaign in 50 must have heard that in u.s. When the president election was happening. They have used in fact role of social media of all this analysis at all and then that have ever played a major role in winning that election similarly, how weather investors want to predict whether they should invest in a particular company or not, whether they want to check that whether like we should Target which customers for advertisement because we cannot Target everyone problem with targeting everyone is and if we try to Target element, it will be very costly so we want to kind of set it a little bit because maybe my set of people whom I should send this advertisement to be more effective and Wells as well as a queen is going to be cost effective as well if you wanted to do the products and services also include I guess we can also do this.
Action, then you will be generating your Tweet data. I'm going to save it in this particular directory. Once you are done with this. Then you are going to extract your sentiment once you extract it. And you're done. Let me show you quickly how it works in our fear. Now one more interesting thing about a greater would be that you will be getting all this consideration machines. So you need not worry about from where I will be getting all this. Is it like very difficult to install when I was waiting. This open source location. It was not working for me in my operating system. It was not working. So many things we have generally seen people face issues to resolve everything up be we kind of provide all this fear question from Rockville. This pm has priest but yes, that's what it has everything pre-installed. Whichever will be required for your training. So that's the best part what we also provide.
So in this case your Eclipse will already be there. You need to just go to your Eclipse location. Let me show you how you can. So cold that if you want because it gives you it gives you just need to go inside it and double-click on it at that. You need not go and kind of installed eclipse and not even the spot will already be installed for you. Let us go in our project. So this is our project which is in front of you. This is my project which we are going to war. Now you can see that we have first imported all the libraries that we have set or more indication system and then we have moved and kind of ecstatic. The D string transformation extractor that we write and then save the output final effect. So these are the things which we have done in this program has now let's execute it to run this program. It's very simple go to run as and from run as click on still application.
You will notice in the end. It is releasing that great good to see that so it is executing the program. Let us execute. I did bring a taxi for Trump. So use these for Trump any way that we surveyed to be negative. Right? It's an achievement because anything you do for Tom will be to be negative Trump is anyway the hot topic for us. Maybe make it a little bigger. You will notice a lot of negative tweets coming up on. Yes, now, I'm just stopping it so that I can show you something. Yes. It's filtering that we thought so we have actually been written back in the program itself. You have given at one location from using that we were kind of asking for a treetop Tom now here we are doing analysis and it is also going to tell us whether it's a positive to a negative resistance is situated. It is giving up Faith because term for Transit even will not quit positive rate.
So that's something which is so that's the reason you're finding. This is a negative. Similarly if there will be any other that we should be getting a static. So right now if I keep on moving ahead we will see multiple negative traits which will come up. So that's how this program runs. So this is how our program we will be executing we can distract it. Even the output results will be getting through at a location as you can see in this if I go to my location here, this is my actual project where it is running so you can just come to this location here are on your output. All your output is Getting through there so you can just take a look as but yes, so it's everything is done by using space thing apart. Okay. That's what we've seen right reverse that we were seeing it with respect to these three transformations in a so we have done all that with have both passed anybody. So that is one of those awesome part about this that you can do such a powerful things with respect to your with respect to you this way. Now, let's analyze the results.
So as we have just seen that it is showing the president's a positive to a negative tweets. So this is where your output is getting Stone as it shown you a doubt will appear like this. Okay. This is just broke your output to explicitly principal also tell whether it's a neutral one positive one negative one everything. We have done it with the help of Sparks. I mean only now we have done it for Trump as I just explained you that we have put in our program itself from like we have put everything up here and based on that only we are getting all the software now we can apply all the sentiment analysis and like this. Like we have learned. So I hope you have found all this this specially this use case very much useful for you kind of getting you that yes, it is getting done by half. But right now we have put from here, but if you want you can keep on putting the hashtag as well because that's how we are doing it. You can keep on changing the tax.
Maybe you can kind of code for let's say four people for stuff is going on a cricket match will be going on we can just put the tweets according to that just take the in that case instead of trump. You can put any player named or maybe a Team name and you will see all that friendly becoming a father. Okay, so that's how you can play with this. Now. This is there are multiple examples with it, which we can play and this new skills can be even evolved multiple other type of those cases. You can just keep on transforming it according to your own use cases. So that's it about Sparks coming which I wanted to discuss. So I hope you must have found it useful. So in classification generally what happens just to give you an example. You must have notice the spam email box. I hope everybody must be having have seen that sparkle in your spam email box Energy Mix. Now when any new email comes up how Google decide whether it's a spam email or unknown stamped image that is done as an example of classification plus 3, let's say My ghost in the Google news, when you type something it group.
All the news together that is called your electric regression equation is also one of the very important fact it is not here. The regression is let's say you have a house and you want to sell that house and you have no idea. What is the optimal price? You should keep for your house. Now this regression will help you too. To achieve that collaborative filtering you might have see when you go to your Amazon web page that they show you a recommendation, right? You can buy this because you are buying this but this is done with the help of colaborative filtering. Before I move to the project, I want to show you some practical find how we will be executing spark things. So let me take you to the VM machine which will be provided by a Dorita. So this machines are also provided by the Rekha. So you need not worry about from where I will be getting the software. What I will be doing recite It Roll there. Everything is taken care back into they come now. Once you will be coming to this you will see a machine like Like this, let me close this. So what will happen you will see a blank machine like this.
Let me show you this. So this is how your machine will look like. Now what you are going to do in order to start working. You will be opening this permanent by clicking on this black option. Now after that, what you can do is you can now go to your spot now how I can work with funds in order to execute any program in sparked by using Funeral program you will be entering it as fast – Chanel if you type fast – gel it will take you to the scale of Ron where you can write your path program, but by using scale of programming language, you can notice this. Now, can you see the fact it is also giving me 1.5.2 version. So that is the version of your spot. Now you can see here. You can also see this part of our context available as a see when you get connected to your spark sure. You can just see this will be my default available to you. Let us get connected. It is sometime.
No, we got anything. So we got connected to this Kayla prom now if I want to come out of it, I will just type exit it will just let me come out of this product now. Secondly, I can also write my programs with my python. So what I can do if I want to do programming and Spark, but with provide Python programming language, I will be connecting with by Sparks. So I just need to type ice pack in order to get connected. Your fighter. Okay. I'm not getting connected now because I'm not going to require. I think I will be explaining everything that scalar item. But if you want to get connected you can type icebox. So let's again get connected to my staff – sure now meanwhile, this is getting connected. Let us create a small pipe. So let us create a file so currently if you notice I don't have any file.
Okay. I already have a DOT txt. So let's say sake at a DOT txt. So I have some data one. Two three four five. This is my data, which is with me. Now what I'm going to do, let me push this file and do select the effective if it is already available in my system as that means SDK system Hadoop DFS – ooh, Jack a dot txt just to quickly check if it is already available. There is no sex by so let me first put this file to my system to put a dot txt. So this will put it in the default location of x g of X. Now if I want to read it, I can see the specs. So again, I'm assuming that you're aware of this as big as commands so you can see now this one two, three four Pilots coming from a Hadoop file system. Now what I want to do, I want to use this file in my in my system of spa now how I can do that select we come here. So in skaila in skaila, we do not have any Your float and on like in Java we use the Define like this right integer K is equal to 10 like this is used to define buttons Kayla. We do not use this data type.
In fact, what we do is we call it as back. So if I use that a is equal to 10, it will automatically identify that it is a integer value notice. It will tell me that a is of my integer type now if I want to Update this value to 20. I can do that. Now. Let's say if I want to update this to ABC like this. This will smoke an error by because a is already defined as in danger and you're trying to assign some PVC string back. So that is the reason you got this error. Similarly. There is one more thing called as value. Well B is equal to 10. Let's say if I do it works exactly a similar to that. But I have one difference now in this case. If I do basic want to 20 you will see an error and why does Sarah because when you define something as well, it is a constant. It is not going to be variable anymore.
It will be a constant and that is the reason if you define something as well, it will be not updatable. You will be should not be able to update that value. So this is how in Fela you will be doing your program so back for bearable part of that for your constant, but now so you will be doing like this now, let's use it for the example what we have learned now. Let's say if I want to create and cut the V. So Bal number is equal to SC dot txt file. Remember this API we have learned the CPI already St. Dot Txt file now. Let me give this file a DOT txt. If I give this file a dot txt. It will be creating an ID will see this file. It is telling that I created an rdd of string type. Now. If I want to read this data, I will call number dot connect.
This will print be the value what was available. Can you say now this line what you are seeing here? Is going to be from your memory. This is your from my body. It is reading a and that is the reason it is showing up in this particular manner. So this is how you will be performing your step. No second thing. I told you that sparked and walk on Standalone systems as well. Right? So right now what was happening was that we have executed this part in our history of this now if I want to execute this Us on our local file system. Can I do that? Yes, it can still do that. What you need to do for that. So is in that case the difference will come here. Now what the file you are giving here would be instead of giving like that. You will be denoting this file keyword before that. And after that you need to give you a local file. For example, what is this part slash home slash. Advocacy.
This is a local park not as deep as possible. So you will be writing / foam. /schedule Erica a DOT PSD. Now if you give this this will be loading the file into memory, but not from your hdfs instead. What does that is this loaded it from your just loaded it formula looks like this so that is the defensive. So as you can see in the second case, I am not even using my hdfs. Which means what now? Can you tell me why this Sarah this is interesting. Why do Sarah input path does not exist because I have given a typo here. Okay. Now if you notice by I did not get this error here why I did not get this Elijah this file do not exist. But still I did not got any error because of lazy evaluation link the evaluation kind of made sure that even if you have given the wrong part in creating And beyond ready but it has not executed anything. So all the output or the error mistake you are able to receive when you hit that action of Collective Now in order to correct this value. I need to connect this adorable and this time if I execute it, it will work.
Okay, you can see this output 1 2 3 4 5. So this time it works by so now we should be more clear about the lazy evaluation of the even if you are giving the wrong file name doesn't matter suppose. I want to use Park in production unit, but not on top of Hadoop. Is it possible? Yes, you can do that. You can do that Sonny, but usually that's not what you do. But yes, if you want to can do that, there are a lot of things which you can view can also deploy it on your Amazon clusters as that lot of things you can do that. How will it provided distribute in that case? We'll be using some other distribution system. So in that case you are not using this fact, you can deploy it will be just death. He will not be able to kind of go across and distribute in that Master.
You will not be able to lift weight that redundancy, but you can use them in Amazon is the enough for that. Okay, so that is how you will be using this now you're going to get so this is how you will be performing your practice as a sec how you will be working on this part. I will be a training you as I told you. So this is how things work. Now, let us see an interesting use case. So for that let us go back. Back to our visiting this is going to be very interesting. So let's see this use case. Look at this. This is very interested. Now this use case is for earthquake detection using Spa. So in Japan you might have already seen that there are so many up to access coming you might have thought about it. I definitely you might have not seen it but you must have heard about it that there are so many earthquake which happens in Japan now how to solve that problem with about I'm just going to give you a glimpse of what kind of problems in solving the sessions definitely we are not going to walk through in detail of this but you will get an idea House of Prince fastest. Okay, just to give you a little bit of brief here.
But all these products will learn at the time of sessions now. So let's see this part how we will be using this bill. So as everybody must be knowing what is asked website. So our crack is like a shaking of your surface of the Earth your own country. Ignore all those events that happen in tector. If you're from India, you might have seen recently there was an earthquake incident which came from Nepal by even recently two days back. Also there was upset incident. So these are techniques on coming now, very important part is let's say if the earthquake is on major earthquake like arguing or maybe tsunami maybe forest fires, maybe a volcano now, it's very important for them to kind of SC. That black is going to come they should be able to kind of predicted beforehand. It's not happen that as a last moment. They got to the that okay Dirtbag is comes after I came up cracking No, it should not happen like that. They should be able to estimate all these things beforehand. They should be able to predict beforehand.
So this is the system with Japan's is using already. So this is a real-time kind of use case what I am presenting. It's so Japan is already using this path finger in order to solve this earthquake problem. We are going to see that how they're using it. Okay. Now let's say what happens in Japan earthquake model. So whenever there is an earthquake coming for example at 2:46 p.m. On March 4 2011 now Japan earthquake early warning was detected. Now the thing was as soon as it detected immediately, they start sending Not those fools to the lift to the factories every station through TV stations. They immediately kind of told everyone so that all the students were there in school. They got the time to go under the desk bullet trains, which were running.
They stop. Otherwise the capabilities of us will start shaking now the bullet trains are already running at the very high speed. They want to ensure that there should be no sort of casualty because of that so all the bullet train Stop all the elevators the lift which were running. They stop otherwise some incident can happen in 60 seconds 60 seconds before this number they were able to inform almost every month. They have send the message. They have a broadcast on TV all those things they have done immediately to all the people so that they can send at least this message whoever can receive it and that have saved millions of So powerful they were able to achieve that they have done all this with the help of Apache spark. That is the most important job how they've got you can select everything what they are doing there. They are doing it on the real time system, right? They cannot just collect the data and then later the processes they did everything as a real-time system. So they collected the data immediately process it and as soon has the detected that has quick they immediately inform the in fact this happened in 2011.
Now they they start using it very frequently because Japan is one of the area which is very frequently of kind of affected by all this. So as I said, the main thing is we should be able to process the data and we are finding that the bigger thing you should be able to handle the data from multiple sources because data may be coming from multiple sources may be different different sources. They might be suggesting some of the other events. It's because Which we are predicting that okay, this earthquake can happen. It should be very easy to use because if it is very complicated then in that case for a user to use it if you'd be very good become competitive service. You will not be able to solve the problem. Now even in the end how to send the alert message is important. Okay. So all those things are taken care by your spark.
Now there are two kinds of layer in your earthquake. The number one layer is a prime the way and second is fake. And we'll wait. There are two kinds of wave in an earthquake Prime Z Wave is like when the earthquake is just about to start it start to the city center and it's vendor or Quake is going to start secondary wave is more severe than which sparked after producing. Now what happens in secondary wheel is when it's that start it can do maximum damage because primary ways you can see the initial wave but the second we will be on top of that so they will be some details with respect to I 'm not going in detail of that. But yeah, there will be some details with respect to that. Now what we are going to do using Sparks. We will be creating our arms. So let's go and see that in our machine how we will be sick calculating our Roc which using which we will be solving this problem later and we will be calculating this Roc with the help of Apache spark. Let us again come back to this machine now in order to walk on that.
Let's first exit from this console. Once you exit from this console now what you're going to do. I have already created this project in kept it here because we just want to give you an overview of this. Let me go to my downloads section. There is a project called as Earth to so this is your project initially what all things you will be having you will not be having all the things initial part. So what will happen. So let's say if I go to my downloads from here, I have worked too. project Okay. Now initially I will not be having this target directory project directory bin directory. We will be using our SBT framework. If you do not know SBP this is the skill of Bill tooth which takes care of all your dependencies takes care of all your dependencies are not so it is very similar to Melvin if you already know Megan you this is because very similar but at the same time I prefer this BTW because as BT is more easier to write income. I've been doing yoga never so you will be writing this bill taught as begins.
So this finally will provide you build dot SBT now in this file, you will be giving the name of your project your what's a version of is because using version of scale of what you are using. What are the dependencies you have with what versions dependencies you have like 4 stock 4 and using 1.5.2 version of stock. So you are telling that whatever in my program, I am writing. So if I require anything related to spawn quote go and get it from this website of dot Apache dot box download it install it. If I require any dependency for spark streaming program for this particular version 1.5.2. Go to this website or this link and executed similar theme for Amanda password. So you just telling them now once you have done this you will be creating a Folder structure. Your folder structure would be you need to create a sassy folder. After that. You will be creating a main folder from Main folder. You will be creating again a folder called as Kayla now inside that you will be keeping your program.
So now here you will be writing a program. So you are writing you. Can you see this screaming to a scalar Network on scale of our DOT Stella. So let's keep it as a black box for them. So you will be writing the code to achieve this problem statement. Now what we are going to do that come out of this What do you mean project folder and from here? We will be writing SBT packaged. It will start downloading with respect to your is beating it will check your program. Whatever dependency you require for stock course starts screaming stuck in the lift. It will download and install it it will just download and install it so we are not going to execute it because I've already done it before and it also takes some time. So that's the reason I'm not doing it now. You have been this packet, you will find all this directly Target directly toward project directed. These got created later on the now what is going to happen.
Once you have created this you will go to your Eclipse. So you are a pure c will open. So let me open my Eclipse. So this is how you're equipped to protect. Now. I already have this program in front of me, but let me tell you how you will be bringing this program. You will be going to your import option with We import you will be selecting your existing projects into workspace. Next once you do that you need to select your main project. For example, you need to select this Earth to project what you have created and click on OK once you do that they will be a project directory coming from this Earth to will come here. Now what we need to do go to your s RC / Main and not ignore all this program. I require only just are jocular because this is where I've written my main function. Important now after that once you reach to this you need to go to your run as Kayla application and your spot code will start to execute now, this will return me a row 0.
Okay. Let's see this output. Now if I see this, this will show me once it's finished executing. See this our area under carosi is this so this is all computed with the elbows path program. Similarly. There are other programs also met will help you to spin the data or not. I'm not walking over all that. Now, let's come back to my wedding and see that what is the next step what we will be doing so you can see this way will be next. Is she getting created now, I'm keeping my Roc here. Now after you have created your RZ you will be Our graph now in Japan there is one important thing. Japan is already of affected area of your organs. And now the trouble here is that whatever it's not the even for a minor earthquake. I should start sending the alert right? I don't want to do all that for the minor minor affection. In fact, the buildings and the infrastructure. What is created is the point is in such a way if any odd quack below six magnitude comes there there.
The phones are designed in a way that they will be no damage. They will be no damage them. So this is the major thing when you work with your Japan free book now in Japan, so that means with six they are not even worried but about six they are worried now for that day will be a graph simulation what you can do you can do it with Park as well. Once you generate this graph you will be seeing that anything which is going above 6 if anything which is going above 6, Should immediately start the vendor now ignore all this programming side because that is what we have just created and showing you this execution fact now if you have to visualize the same result, this is what is happening. This is showing my Roc but if my artwork is going to be greater than 6 then only weighs those alert then only send the alert to all the paper. Otherwise take come that is what the project what we generally show. Oh in our space program sent now it is not the only project we also kind of create multiple other products as well.
For example, I kind of create a model just like how Walmart to it how Walmart maybe creating a whatever sales is happening with respect to that. They're using Apache spark and at the end they are kind of making you visualize the output of doing whatever analytics they're doing. So that is ordering the spark. So all those things we walking through when we do the per session all the things you learn quick. I feel that all these projects are using right now, since you do not know the topic you are not able to get hundred percent of the project. But at that time once you know each and every topics of deadly you will have a clearer picture of how spark is handling. All these use cases graphs are very attractive when it comes to modeling real world data because they are intuitive flexible and the theory supporting them has Been maturing for centuries welcome everyone in today's session on Spa Graphics. So without any further delay, let's look at the agenda first.
We start by understanding the basics of craft Theory and different types of craft. Then we'll look at the features of Graphics further will understand what is property graph and look at various crafts operations. Moving ahead. We'll look at different graph processing algorithms at last. We'll look at a demo where we will try to analyze Ford's go by data using pagerank algorithm. Let's move to the first topic. So we'll start with basics of graph. So graphs are I basically made up of two sets called vertices and edges. The vertices are drawn from some underlying type and the set can be finite or infinite. Now each element of the edge set is a pair consisting of two elements from the vertices set. So your vertex is V1.
Then your vertex is V3. Then your vertex is V2 and V4. And your edges are V 1 comma V 3 then next is V 1 comma V 2 Then you have B2 comma V 3 and then you have V 2 comma V fo so basically we represent vertices set as closed in curly braces all the name of vertices. So we have V 1 we have V 2 we have V 3 and then we have before and we'll close the curly braces and to represent the edge set. We use curly braces again and then in curly braces, we specify those two vertex which are joined by the edge. So for this Edge, we will use a viven comma V 3 and then for this Edge will use we one comma V 2 and then for this Edge again, we'll use V 2 comma V 4. And then at last for this Edge will use we do comma V 3 and At Last I will close the curly braces. So this is your vertices set. And this is your headset. Now one, very important thing that is if headset is containing U comma V or you can say that are instead is containing V 1 comma V 3. So V1 is basically a adjacent to V 3.
Similarly your V 1 is adjacent to V 2. Then V2 is adjacent to V for and looking at this as you can say V2 is adjacent to V 3. Now, let's quickly move ahead and we'll look at different types of craft. So first we have undirected graphs. So basically in an undirected graph, we use straight lines to represent the edges. Now the order of the vertices in the edge set does not matter in undirected graph. So the undirected graph usually are drawn using straight lines between the vertices. Now it is almost similar to the graph which we have seen in the last slide. Similarly. We can again represent the vertices set as 5 comma 6 comma 7 comma 8 and the edge set as 5 comma 6 then 5 comma 7 now talking about directed graphs. So basically in a directed graph the order of vertices in the edge set matters. So we use Arrow to represent the edges as you can see in the image as It was not the case with the undirected graph where we were using the straight lines.
So in directed graph, we use Arrow to denote the edges and the important thing is The Edge set should be similar. It will contain the source vertex that is five in this case and the destination vertex, which is 6 in this case and this is never similar to six comma five you cannot represent this Edge as 6 comma 5 because the direction always Does indeed directed graph similarly you can see that 5 is adjacent to 6, but you cannot say that 6 is adjacent to 5. So for this graph the vertices said would be similar as 5 comma 6 comma 7 comma 8 which was similar in undirected graph, but in directed graph your Edge set should be your first opal. This one will be 5 comma 6 then you second Edge, which is this one would be five comma Mama seven, and at last your this set would be 7 comma 8 but in case of undirected graph you can write this as 8 comma 7 or in case of undirected graph you can write this one as seven comma 5 but this is not the case with the directed graph. You have to follow the source vertex and the destination vertex to represent the edge. So I hope you guys are clear with the undirected and directed graph. Now. Let's talk about vertex label graph now.
A Vertex liberal graph each vertex is labeled with some data in addition to the data that identifies the vertex. So basically we say this X or this v as the vertex ID. So there will be data that would be added to this vertex. So let's say this vertex would be 6 comma and then we are adding the color so it would be purple next. This vertex would be 8 comma and the color would be green next. We'll say See this as 7 comma read and then this one is as five comma blue now the six or this five or seven or eight. These are vertex ID and the additional data, which is attached is the color like blue purple green or red. But only the identifying data is present in the pair of edges or you can say only the ID of the vertex is present in the edge set. So here the Edsel. Again similar to your directed graph that is your Source ID this which is 5 and then destination ID, which is 6 in this case then for this case.
It's similar as five comma 7 then in for this case. It's similar as 7 comma 8 so we are not specifying this additional data, which is attached to the vertices. That is the color. If you only specify the identifiers of the vertex that is the number but your vertex set would be something like so this vertex would be 5 comma blue then your next vertex will become 6 comma purple then your next vertex will become 8 comma green and at last your last vertex will be written as 7 comma read. So basically when you are specifying the vertices set in the vertex label graph you attach the additional information in the vertices are set but while representing the edge set it is represented similarly as A directed graph where you have to just specify the source vertex identifier and then you have to specify the destination vertex identifier now. I hope that you guys are clear with underrated directed and vertex label graph. So let's quickly move forward next we have cyclic graph. So a cyclic graph is a directed graph with at least one cycle and the cycle is the path along with the directed edges from a Vertex to itself.
So so once you see over here, you can see that from this vertex V. It's moving toward x 7 then it's moving to vertex Aid then with arrows moving to vertex six. And then again, it's moving to vertex V. So there should be at least one cycle in a cyclic graph. There might be a new component. It's a Vertex 9 which is attached over here again, so it would be a cyclic graph because it has one complete cycle over here and the important thing to notice is That the arrow should make the cycle like from 5 to 7 and then from 7 to 8 and then 8 to 6 and 6 to 5 and let's say that there is an arrow from 5 to 6 and then there is an arrow from 6 to 8. So we have flipped the arrows. So in that situation, this is not a cyclic graph because the arrows are not completing the cycle. So once you move from 5 to 7 and then from 7 to 8, you cannot move from 8:00 to 6:00 and similarly once you move from 5 to 6 and then 6 to 8. You cannot move from 8 to 7.
So in that situation, it's not a cyclic graph. So let's clear all this thing. So will represent this cycle as five then using double arrows will go to 7 and then we'll move to 8 and then we'll move to 6 and at last we'll come back to 5 now. We have Edge liberal graph. So basically as label graph is a graph. The edges are associated with labels. So one can basically indicate this by making the edge set as be a set of triplets. So for example, let's say this H in this Edge label graph will be denoted as the source which is 6 then the destination which is 7 and then the label of the edge which is blue. So this Edge would be defined something like 6 comma 7 comma blue and then for this and Hurley The Source vertex that is 7 the destination vertex, which is 8 then the label of the edge, which is white like similarly for this Edge. It's five comma 7 and then blue comma red. And it lasts for this Edge. It's five comma six and then it would be yellow common green, which is the label of the edge. So all these four edges will become the headset for this graph and the vertices set is almost similar that is 5 comma 6 comma 7 comma 8 now to generalize this I would say x comma y so X here is the source vertex then why here is the destination vertex? X and then a here is the label of the edge then Edge label graph are usually drawn with the labels written adjacent to the Earth specifying the edges as you can see.
We have mentioned blue white and all those label addition to the edges. So I hope you guys a player with the edge label graph, which is nothing but labels attached to each and every Edge now, let's talk about weighted graph. So we did graph is an edge label draft. Where the labels can be operated on by usually automatic operators or comparison operators, like less than or greater than symbol usually these are integers or floats and the idea is that some edges may be more expensive and this cost is represented by the edge labels or weights now in short weighted graphs are a special kind of Edgley build rafts where your Edge is attached to a weight. Generally, which is a integer or a float so that we can perform some addition or subtraction or different kind of automatic operations or it can be some kind of conditional operations like less than or greater than so we'll again represent this Edge as 5 comma 6 and then the weight as 3 and similarly will represent this Edge as 6 comma 7 and the weight is again 6 so similarly we represent these two edges as well.
So I hope that you guys are clear with the weighted graphs. Now let's quickly move ahead and look at this directed acyclic graph. So this is a directed acyclic graph, which is basically without Cycles. So as we just discussed in cyclic graphs here, you can see that it is not completing the graph from the directions or you can say the direction of the edges, right? We can move from 5 to 7, then seven to eight but we cannot move from 8 to 6 and similarly we can move from 5:00 to 6:00 then 6:00 to 8:00, but we cannot move from 8 to 7. So this is Not forming a cycle and these kind of crafts are known as directed acyclic graph. Now, they appear as special cases in CS application all the time and the vertices set and the edge set are represented similarly as we have seen earlier not talking about the disconnected graph. So vertices in a graph do not need to be connected to other vertices. It is basically legal for a graph to have disconnected components or even loan vertices without a single connection.
So basically this disconnected graph which has four vertices but no edges. Now. Let me tell you something important that is what our sources and sinks. So let's say we have one Arrow from five to six and one Arrow from 5 to 7 now word is with only in arrows are called sink. So the 7 and 6 are known as sinks and the vertices with only out arrows are called sources. So as you can see in the image this Five only have out arrows to six and seven. So these are called sources now. We'll talk about this in a while guys. Once we are going through the pagerank algorithm. So I hope that you guys know what our vertices what our edges how vertices and edges represents the graph then what are different kinds of graph? Let's move to the next topic. So next let's know. What is Park Graphics. So talking about Graphics Graphics is a new component in spark. For graphs and crafts parallel computation now at a high level graphic extends The Spark rdd by introducing a new graph abstraction that is directed multigraph that is properties attached to each vertex and Edge now to support craft computation Graphics basically exposes a set of fundamental operators, like finding sub graph for joining vertices or aggregating messages as well as it also exposes and optimize.
This variant of the pregnant a pi in addition Graphics also provides you a collection of graph algorithms and Builders to simplify your spark analytics tasks. So basically your graphics is extending your spark rdd. Then you have Graphics is providing an abstraction that is directed multigraph with properties attached to each vertex and Edge. So we'll look at this property graph in a while. Then again Graphics gives you some fundamental operators and Then it also provides you some graph algorithms and Builders which makes your analytics easier now to get started you first need to import spark and Graphics into your project. So as you can see, we are importing first Park and then we are importing spark Graphics to get those graphics functionalities.
And at last we are importing spark rdd to use those already functionalities in our program. But let me tell you that if you are not using spark shell then you will need a spark. Context in your program. So I hope that you guys are clear with the features of graphics and the libraries which you need to import in order to use Graphics. So let us quickly move ahead and look at the property graph. Now property graph is something as the name suggests property graph have properties attached to each vertex and Edge. So the property graph is a directed multigraph with user-defined objects attached to each vertex and Edge. Now you might be wondering what is undirected multigraph. So a directed multi graph is a directed graph with potentially multiple parallel edges sharing same source and same destination vertex. So as you can see in the image that from San Francisco to Los Angeles, we have two edges and similarly from Los Angeles to Chicago. There are two edges.
So basically in a directed multigraph, the first thing is the directed graph, so it should have a Direction. Ian attached to the edges and then talking about multigraph so between Source vertex and a destination vertex, there could be two edges. So the ability to support parallel edges basically simplifies the modeling scenarios where there can be multiple relationships between the same vertices for an example. Let's say these are two persons so they can be friends as well as they can be co-workers, right? So these kind of scenarios can be Easily modeled using directed multigraph now. Each vertex is keyed by a unique 64-bit long identifier, which is basically the vertex ID and it helps an indexing. So each of your vertex contains a Vertex ID, which is a unique 64-bit long identifier and similarly edges have corresponding source and destination vertex identifiers. So this Edge would have this vertex identifier as well as This vertex identifier or you can say Source vertex ID and the destination vertex ID.
So as we discuss this property graph is basically parameterised over the vertex and Edge types, and these are the types of objects associated with each vertex and Edge. So your graphics basically optimizes the representation of vertex and Edge types and it reduces the in memory footprint by storing the primitive data types in a specialized array. In some cases it might be desirable to have vertices with different property types in the same graph. Now this can be accomplished through inheritance. So for an example to model a user and product in a bipartite graph, or you can see that we have user property and we have product property. Okay. So let me first tell you what is a bipartite graph. So a bipartite graph is also called a by graph which is a set of graph vertices.
Opposed into two disjoint sets such that no two graph vertices within the same set are adjacent. So as you can see over here, we have user property and then we have product property but no to user property can be adjacent or you can say there should be no edges that is joining any of the to user property or there should be no Edge that should be joining product property. So in this scenario we use inheritance. So as you can see here, we have class vertex property now basically what we are doing we are creating another class with user property. And here we have name, which is again a string and we are extending or you can say we are inheriting the vertex property class. Now again, in the case of product property. We have name that is name of the product which is again string and then we have price of the product which is double and we are again extending this vertex property graph and at last You're grading a graph with this vertex property and then string. So this is how we can basically model user and product as a bipartite graph. So we have created user property as well as we have created this product property and we are extending this vertex property class.
No talking about this property graph. It's similar to your rdd. So like your rdd property graph are immutable distributed and fault tolerant. So changes to the values or structure of the graph. Basically accomplished by producing a new graph with the desired changes and the substantial part of the original graph which can be your structure of the graph or attributes or indices. These are basically reused in the new graph reducing the cost of inherent functional data structure. So basically your property graph once you're trying to change values of structure. So it creates a new graph with changed structure or changed values and zero substantial part of original graph. Re used multiple times to improve the performance and it can be your structure of the graph which is getting reuse or it can be your attributes or indices of the graph which is getting reused. So this is how your property graph provides efficiency. Now, the graph is partitioned across the executors using a range of vertex partitioning rules, which are basically Loosely defined and similar to our DD each partition of the graph can be recreated on different machines in the event of Failure.
So this is how your property graph provides fault tolerance. So as we already discussed logically the property graph corresponds to a pair of type collections, including the properties for each vertex and Edge and as a consequence the graph class contains members to access the vertices and the edges. So as you can see we have graphed class then you can see we have vertices and we have edges. Now this vertex Rd DVD is extending your rdd, which is your body D and then your vertex ID and then your vertex property. Similarly. Your Edge rdd is extending your Oddity with your Edge property so the classes that is vertex rdd and HR DD extends under optimized version of your rdd, which includes vertex idn vertex property and your rdd which includes your Edge property and Booth this vertex rdd and hrd provides additional functionality build on top of graph computation and leverages internal optimizations as well.
So this is the reason we use this Vertex rdd or Edge already because it already extends your already containing your word. X ID and vertex property or your Edge property it also provides you additional functionalities built on top of craft computation. And again, it gives you some internal optimizations as well. Now, let me clear this and let's take an example of property graph where the vertex property might contain the user name and occupation. So as you can see in this table that we have ID of the vertex and then we have property attached to each vertex. That is the username as well as the Station of the user or you can see the profession of the user and we can annotate the edges with the string describing the relationship between the users. So so as you can see first is Thomas who is a professor then second is Frank who is also a professor then as you can see third is Jenny. She's a student and forth is Bob who is a doctor now Thomas is a colleague of Frank. Then you can see that Thomas is academic advisor of Jenny again. Frank is also a Make advisor of Jenny and then the doctor is the health advisor of Jenny.
So the resulting graph would have a signature of something like this. So I'll explain this in a while. So there are numerous ways to construct the property graph from raw files or RDS or even synthetic generators and we'll discuss it in graph Builders, but the very probable and most General method is to use graph object. So let's take a look at the code first. And so first over here, we are assuming that Parker context has already been constructed. Then we are giving the SES power context next. We are creating an rdd for the vertices. So as you can see for users, we have specified idd and then vertex ID and then these are two strings. So first one would be your username and the second one will be your profession. Then we are using SC paralyzed and we are creating an array where we are specifying all the vertices so And that is this one and you are getting the name as Thomas and the profession is Professor similarly for to well Frank Professor. Then 3L Jenny cheese student and 4L Bob doctors. So here we have created the vertex next. We are creating an rdd for edges. So first we are giving the values relationship. Then we are creating an rdd with Edge string and then we're using SC paralyzed to create the edge and in the array we are specifying the A source vertex, then we are specifying the destination vertex.
And then we are giving the relation that is colleague similarly for next Edge resources when this nation is one and then the profession is academic advisor and then it goes so on. So then this line we are defining a default user in case there is a relationship between missing users. Now we have given the name as default user and the profession is missing. Nature trying to build an initial graph. So for that we are using this graph object. So we have specified users that is your vertices. Then we are specifying the relations that is your edges. And then we are giving the default user which is basically for any missing user. So now as you can see over here, we are using Edge case class and edges have a source ID and a destination ID, which is basically corresponding to your source and destination vertex. And in addition to the Edge class. We have an attribute member which stores The Edge property which is the relation over here that is colleague or it is academic advisor or it is Health advisor and so on. So, I hope that you guys are clear about creating a property graph how to specify the vertices how to specify edges and then how to create a graph Now we can deconstruct a graph into respective vertex and Edge views by using a graph toward vertices and graph edges members.
So as you can see we are using craft or vertices over here and crafts dot edges over here. Now what we are trying to do. So first over here the graph which we have created earlier. So we have graphed vertices dot filter Now using this case class. We have this vertex ID. We have the name and then we have the position. And we are specifying the position as doctor. So first we are trying to filter the profession of the user as doctor. And then we are trying to count. It. Next. We are specifying graph edges filter and we are basically trying to filter the edges where the source ID is greater than your destination ID. And then we are trying to count those edges. We are using a Scala case expression as you can see to deconstruct the temple. You can say to deconstruct the result on the other hand craft edges returns a edge rdd, which is containing Edge string object.
So we could also have used the case Class Type Constructor as you can see here. So again over here we are using graph dot s dot filter and over here. We have given case h and then we are specifying the property that is Source destination and then property of the edge which is attached. And then we are filtering it and then we are trying to count it. So this is how using Edge class either you can see with edges or you can see with vertices. This is how you can go ahead and deconstruct them. Right because you're grounded vertices or your s dot vertices returns a Vertex rdd or Edge rdd. So to deconstruct them, we basically use this case class. So I hope you guys are clear about transforming property graph. And how do you use this case? Us to deconstruct the protects our DD or HR DD. So now let's quickly move ahead. Now in addition to the vertex and Edge views of the property graph Graphics also exposes a triplet view now, you might be wondering what is a triplet view.
So the triplet view logically joins the vertex and Edge properties yielding an rdd edge triplet with vertex property and your Edge property. So as you can see it gives an rdd. D with s triplet and then it has vertex property as well as H property associated with it and it contains an instance of each triplet class. Now. I am taking example of a join. So in this joint we are trying to select Source ID destination ID Source attribute then this is your Edge attribute and then at last you have destination attribute. So basically your edges has Alias e then your vertices has Alias as source. And again your vertices has Alias as Nation so we are trying to select Source ID destination ID, then Source, attribute and destination attribute, and we also selecting The Edge attribute and we are performing left join. The edge Source ID should be equal to Source ID and the h destination ID should be equal to destination ID. And now your Edge triplet class basically extends your Edge class by adding your Source attribute and destination attribute members which contains the source and destination properties and we can use the triplet view of a graph to render a collection of strings describing relationship between users.
This is vertex 1 which is again denoting your user one. That is Thomas and who is a professor and is vertex 3, which is denoting you Jenny and she's a student. And this is your Edge, which is defining the relationship between them. So this is a h triplet which is denoting the both vertex as well as the edge which denote the relation between them. So now looking at this code first we have already created the graph then we are taking this graph. We are finding the triplets and then we are mapping each triplet. We are trying to find out the triplet dot Source attribute in which we are picking up the username. Then over here. We are trying to pick up the triplet attribute, which is nothing but the edge attribute which is your academic advisor. Then we are trying to pick up the triplet destination attribute. It will again pick up the username of destination attribute, which is username of this vertex 3. So for an example in this situation, it will print Thomas is the academic advisor of Jenny.
So then we are trying to take this facts. We are collecting the facts using this forage we have Painting each of the triplet that is present in this graph. So I hope that you guys are clear with the concepts of triplet. So now let's quickly take a look at graph Builders. So as I already told you that Graphics provides several ways of building a graph from a collection of vertices and edges either. It can be stored in our DD or it can be stored on disk. So in this graph object first, we have this apply method. So basically this apply method allows creating a graph from rdd of vertices and edges and duplicate vertices are picked up our by Tralee and the vertices which are found in the Edge rdd and are not present in the vertices rdd are assigned a default attribute. So in this apply method first, we are providing the vertex rdd then we are providing the edge rdd and then we are providing the default vertex attribute. So it will create the vertex which we have specified.
Then it will create the edges which are specified and if there is a vertex which is being referred by The Edge, but it is not present in this vertex rdd. So So what it does it creates that vertex and assigns them the value of this default vertex attribute. Next we have from edges. So graph Dot from edges allows creating a graph only from the rdd of edges which automatically creates any vertices mentioned in the edges and assigns them the default value. So what happens over here you provide the edge rdd and all the vertices that are present in the hrd are automatically created and Default value is assigned to each of those vertices. So graphed out from adjustables basically allows creating a graph from only the rdd of vegetables and it assigns the edges as value 1 and again the vertices which are specified by the edges are automatically created and the default value which we are specifying over here will be allocated to them. So basically you're from has double supports deduplicating of edges, which means you can remove the duplicate edges, but for that you have to provide a partition strategy in the unique edges parameter as it is necessary to co-locate The Identical edges on the same partition duplicate edges can be removed.
So moving ahead men of the graph Builders re partitions, the graph edges by default instead edges are left in their default partitions. So as you can see, we have a graph loader object, which is basically used to load. Crafts from the file system so graft or group edges requires the graph to be re-partition because it assumes that identical edges will be co-located on the same partition. And so you must call graph dot Partition by before calling group edges. So so now you can see the edge list file method over here which provides a way to load a graph from the list of edges which is present on the disk and it It passes the adjacency list that is your Source vertex ID and the destination vertex ID Pairs and it creates a graph. So now for an example, let's say we have two and one which is one Edge then you have for one which is another Edge and then you have 1/2 which is another Edge.
So it will load these edges and then it will create the graph. So it will create 2, then it will create for and then it will create one. And for to one it will create the edge and then for one it will create the edge and at last we create an edge for one and two. So do you create a graph something like this? It creates a graph from specified edges where automatically vertices are created which are mentioned by the edges and all the vertex and Edge attribute are set by default one and as well as one will be associated with all the vertices. So it will be 4 comma 1 then again for this. It would be 1 comma 1 and similarly it would be 2 comma 1 for this vertex. Now, let's go back to the code. So then we have this canonical orientation. So this argument allows reorienting edges in the positive direction that is from the lower Source ID to the higher destination ID now, which is basically required by your connected components algorithm will talk about this algorithm in a while you guys but before this this basically helps in view orienting your edges, which means your Source vertex, Tex should always be less than your destination vertex.
So in that situation it might reorient this Edge. So it will reorient this Edge and basically to reverse direction of the edge similarly over here. So with the vertex which is coming from 2 to 1 will be reoriented and will be again reversed. Now the talking about the minimum Edge partition this minimum Edge partition basically specifies the minimum number of edge partitions to generate There might be more Edge partitions than a specified. So let's say the hdfs file has more blocks. So obviously more partitions will be created but this will give you the minimum Edge partitions that should be created. So I hope that you guys are clear with this graph loader how this graph loader Works how you can go ahead and provide the edge list file and how it will create the craft from this Edge list file and then this canonical orientation where we are again going and reorienting the graph and then we have Minimum Edge partition which is giving the minimum number of edge partitions that should be created. So now I guess you guys are clear with the graph Builder.
So how to go ahead and use this graph object and how to create graph using apply from edges and from vegetables method and then I guess you might be clear with the graph loader object and where you can go ahead and create a graph from Edge list. Now. Let's move ahead and talk about vertex and Edge rdd. So as I already told you that Graphics exposes our DD views of the vertices and edges stored within the graph at however, because Graphics again maintains the vertices and edges in optimize data structure and these data structure provide additional functionalities as well. Now, let us see some of the additional functionalities which are provided by them. So let's first talk about vertex rdd. So I already told you that vertex rdd. He is basically extending this rdd with vertex ID and the vertex property and it adds an additional constraint that each vertex ID occurs only words now moreover vertex rdd a represents a set of vertices each with an attribute of type A now internally what happens this is achieved by storing the vertex attribute in an reusable, hash map data structure.
So suppose, this is our hash map data structure. So suppose if to vertex rdd are derived from the same base vertex rdd suppose. These are two vertex rdd which are basically derived from this vertex rdd so they can be joined in constant time without hash evaluations. So you don't have to go ahead and evaluate the properties of both the vertices you can easily go ahead and you can join them without the Yes, and this is one of the way in which this vertex already provides you the optimization now to leverage this indexed data structure the vertex rdd exposes multiple additional functionalities. So it gives you all these functions as you can see here. It gives you filter map values then – difference left join in a joint and aggregate using index functions. So let us first discuss about these functions. So basically filter a function filters the vertex set but preserves the internal index So based on some condition. It filters the vertices that are present then in map values. It is basically used to transform the values without changing the IDS and which again preserves your internal index.
So it does not change the idea of the vertices and it helps in transforming those values now talking about the – method it shows What is unique in the said based on their vertex IDs? So what happens if you are providing to set of vertices first contains V1 V2 and V3 and second one contains V3, so it will return V1 and V2 because they are unique in both the sets and it is basically done with the help of vertex ID. So next we have dysfunction. So it basically removes the vertices from this set that appears in another set Then we have left join an inner join. So join operators basically take advantage of the internal indexing to accelerate join. So you can go ahead and you can perform left join or you can perform inner join. Next you have aggregate using index. So basically is aggregate using index is nothing by reduced by key, but it uses index on this rdd to accelerate the Reduce by key function or you can say reduced by key operation. So again filter is actually Using bit set and there by reusing the index and preserving the ability to do fast joints with other vertex rdd now similarly the map values operator as well.
Do not allow the map function to change the vertex ID and this again helps in reusing the same hash map data structure now both of your left join as well as your inner join is able to identify that whether the two vertex rdd which are joining are derived from the same. Hash map or not. And for this they basically use linear scan did again don't have to go ahead and search for costly Point lookups. So this is the benefit of using vertex rdd. So to summarize your vertex audit abuses hash map data structure, which is again reusable. They try to preserve your indexes so that it would be easier to create a new vertex already derive a new vertex already from them then again while performing some joining or Relations, it is pretty much easy to go ahead perform a linear scan and then you can go ahead and join those two vertex rdd.
So it actually helps in optimizing your performance. Now moving ahead. Let's talk about HR DD now again, as you can see your Edge already is extending your rdd with property Edge. Now it organizes the edge in Block partition using one of the various partitioning strategies, which is again defined in Your partition strategies attribute or you can say partition strategy parameter within each partition each attribute and a decency structure are stored separately which enables the maximum reuse when changing the attribute values. So basically what it does while storing your Edge attributes and your Source vertex and destination vertex, they are stored separately so that changing the values of the attributes either of the source Vertex or Nation Vertex or Edge attribute so that it can be reused as many times as we need by changing the attribute values itself. So that once the vertex ID is changed of an edge. It could be easily changed and the earlier part can be reused now as you can see, we have three additional functions over here that is map values reverse an inner join.
So in hrd basically map values is to transform the edge attributes while preserving the structure. ER it is helpful in transforming so you can use map values and map the values of Courage rdd. Then you can go ahead and use this reverse function which rivers The Edge reusing both attribute and structure. So the source becomes destination. The destination becomes Source not talking about this inner join. So it basically joins to Edge rdds partitioned using same partitioning strategy. Now as we already discuss that same partition strategies, Tired because again to co-locate you need to use same partition strategy and your identical vertex should reside in same partition to perform join operation over them. Now. Let me quickly give you an idea about optimization performed in this Graphics. So Graphics basically adopts a Vertex cut approach to distribute graph partitioning. So suppose you have five vertex and then they are connected. Let's not worry about the arrows, right? Now or let's not worry about Direction right now.
So either it can be divided from the edges, which is one approach or again. It can be divided from the vertex. So in that situation, it would be divided something like this. So rather than splitting crafts along edges Graphics partition is the graph along vertices, which can again reduce the communication and storage overhead. So logically what happens that your edges are assigned to machines and allowing your vertices to span multiple machines. So what this is is basically divided into multiple machines and your edges is assigned to a single machine right then the exact method of assigning edges. Depends on the partition strategy. So the partition strategy is the one which basically decides how to assign the edges to different machines or you can send different partitions. So user can choose between different strategies by partitioning the graph with the help of this graft Partition by operator. Now as we discussed that this craft or Partition by operator three partitions and then it divides or relocates the edges and basically we try to put the identical edges.
On a single partition so that different operations like join can be performed on them. So once the edges have been partitioned the mean challenge is efficiently joining the vertex attributes with the edges right now because real world graphs typically have more edges than vertices. So we move vertex attributes to the edges and because not all the partitions will contain edges adjacent to all vertices. We internally maintain a row. Routing table. So the routing table is the one who will broadcast the vertices and 10 will implement the join required for the operations. So, I hope that you guys are clear how vertex rdd and hrd works and then how the optimizations take place and how vertex cut optimizes the operations in graphics. Now, let's talk about graph operators. So just as already have basic operations like map filter reduced by key property graph also have Election of basic operators that take user-defined functions and produce new graphs the transform properties and structure Now The Co-operators that have optimized implementation are basically defined in crafts class and convenient operators that are expressed as a composition of The Co-operators are basically defined in your graphs class. But in Scala it implicit the operators in graph Ops class, they are automatically available as a member of graft class so you can use them.
M using the graph class as well now as you can see we have list of operators like property operator, then you have structural operator. Then you have join operator and then you have something called neighborhood operator. So let's talk about them one by one now talking about property operators, like rdd has map operator the property graph contains map vertices map edges and map triplets operators right now. Each of this operator basically eels a new graph with the vertex or Edge property. Modified by the user-defined map function based on the user-defined map function it basically transforms or modifies the vertices if it's map vertices or it transform or modify the edges if it is map edges method or map is operator and so on format repeats as well. Now the important thing to note is that in each case. The graph structure is unaffected and this is a key feature of these operators.
Basically which allows the resulting graph to reuse the structural indices. Of the original graph each and every time you apply a transformation, so it creates a new graph and the original graph is unaffected so that it can be used so you can see it can be reused in creating new graphs. Right? So your structure indices can be used from the original graph not talking about this map vertices. Let me use the highlighter. So first we have map vertices. So be it Maps the vertices or you can still transform the vertices. So you provide vertex ID and then vertex. And you apply some of the transformation function using which so it will give you a graph with newer text property as you can see now same is the case with map edges. So again you provide the edges then you transform the edges. So initially it was Ed and then you transform it to Edie to and then the graph which is given or you can see the graph which is returned is the graph for the changed each attribute. So you can see here the attribute is ed2.
Same is the case with Mark triplets. So using Mark triplets, you can use the edge triplet where you can go ahead and Target the vertex Properties or you can say vertex attributes or to be more specific Source vertex attribute as well as destination vertex attribute and the edge attribute and then you can apply transformation over those Source attributes or destination attributes or the edge attributes so you can change them and then it will again return a graph with the transformed values now, I guess you guys are clear the property operator. So let's move Next operator that is structural operator So currently Graphics supports only a simple set of commonly use structural operators. And we expect more to be added in future. Now you can see in structural operator. We have reversed operator. Then we have subgraph operator. Then we have masks operator and then we have group edges operator. So let's talk about them one by one so first reverse operator, so as the name suggests, it returns a new graph with all the edge directions reversed. So basically it will change your Source vertex into destination vertex, and then it will change your destination vertex into Source vertex.
So it will reverse the direction of your edges. And the reverse operation does not modify Vertex or Edge Properties or change. The number of edges. It can be implemented efficiently without data movement or duplication. So next we have subgraph operator. So basically subgraph operator takes the vertex and Edge predicates or you can say Vertex or edge condition and Returns the Of containing only the vertex that satisfy those vertex predicates and then it Returns the edges that satisfy the edge predicates. So basically will give a condition about edges and vertices and those predicates which are fulfilled or those vertex which are fulfilling the predicates will be only returned and again seems the case with your edges and then your graph will be connected. Now, the subgraph operator can be used in a number of situations to restrict the graph to the vertices and edges of interest and eliminate the Rest of the components, right so you can see this is The Edge predicate. This is the vertex predicate. Then we are providing the extra plate with the vertex and Edge attributes and we are waiting for the Boolean value then same is the case with vertex. We're providing the vertex properties over here or you can say vertex attribute over here.
And then again, it will yield a graph which is a sub graph of the original graph which will fulfill those predicates now, the next operator is mask operator. So mask operator Constructors. Graph by returning a graph that contains the vertices and edges that are also found in the input graph. Basically, you can treat this mask operator as a comparison between two graphs. So suppose. We are comparing graph 1 and graph 2 and it will return this sub graph which is common in both the graphs again. This can be used in conjunction with the subgraph operator. Basically to restrict a graph based on properties in another related graph, right. And so I guess you guys are clear with the mask operator. So we're here. We're providing a graph and then we are providing the input graph as well. And then it will return a graph which is basically a subset of both of these graph not talking about group edges.
So the group edges operator merges the parallel edges in the multigraph, right? So what it does it, the duplicate edges between pair of vertices are merged or you can say are at can be aggregated or perform some action and in many numerical applications I just can be added and their weights can be combined into a single edge, right which will again reduce the size of the graph. So for an example, you have to vertex V1 and V2 and there are two edges with weight 10 and 15. So actually what you can do is you can merge those two edges if they have same direction and you can represent the way to 25. So this will actually reduce the size of the graph now looking at the next operator, which is join operator. So in many cases it is necessary. To join data from external collection with graphs, right? For example.
We might have an extra user property that we want to merge with the existing graph or we might want to pull vertex property from one graph to another right. So these are some of the situations where you go ahead and use this join operators. So now as you can see over here, the first operator is joined vertices. So the joint vertices operator joins the vertices with the input rdd and returns a new graph with the vertex properties. Dean after applying the user-defined map function now the vertices without a matching value in the rdd basically retains their original value not talking about outer join vertices. So it behaves similar to join vertices except that which user-defined map function is applied to all the vertices and can change the vertex property type. So suppose that you have a old graph which has a Vertex attribute as old price and then you created a new a graph from it and then it has the vertex attribute as new rice.
So you can go ahead and join two of these graphs and you can perform an aggregation of both the Old and New prices in the new graph. So in this kind of situation join vertices are used now moving ahead. Let's talk about neighborhood aggregation now key step in many graph analytics is aggregating the information about the neighborhood of each vertex for an example. We might want to know the number of followers each user has Or the average age of the follower of each user now many iterative graph algorithms, like pagerank shortest path, then connected components repeatedly aggregate the properties of neighboring vertices. Now, it has four operators in neighborhood aggregation. So the first one is your aggregate messages. So the core aggregation operation in graphics is aggregate messages. Now this operator applies a user-defined send message function as you can see over here to Each of the edge triplet in the graph and then it uses merge message function to aggregate those messages at the destination vertex.
Now the user-defined send message function takes an edge context as you can see and which exposes the source and destination address Buttes along with the edge attribute and functions like send to Source or send to destination is used to send messages to source and destination attributes. Now you can think of send message as the map. Function in mapreduce and the user-defined merge function which actually takes the two messages which are present on the same Vertex or you can see the same destination vertex and it again combines or aggregate those messages and produces a single message. Now, you can think of the merge message as reduce function the mapreduce now, the aggregate messages operator returns a Vertex rdd. Basically, it contains the aggregated messages at each of the destination vertex. It's and vertices that did not receive a message are not included in the returned vertex rdd. So only those vertex are returned which actually have received the message and then those messages have been merged.
If any vertex which haven't received. The message will not be included in the returned rdd or you can say a return vertex rdd now in addition as you can see we have a triplets Fields. So aggregate messages takes an optional triplet fields, which indicates what data is. Accessed in the edge content. So the possible options for the triplet fields are defined interpret fields to default value of triplet Fields is triplet Fields oil as you can see over here this basically indicates that user-defined send message function May access any of the fields in the edge content. So this triplet field argument can be used to notify Graphics that only these part of the edge content will be needed which basically allows Graphics to select the optimize joining. Strategy, so I hope that you guys are clear with the aggregate messages. Let's quickly move ahead and look at the second operator. So the second operator is mapreduce triplet transition. Now in earlier versions of Graphics neighborhood aggregation was accomplished using the mapreduce triplets operator. This mapreduce triplet operator is used in older versions of Graphics. This operator takes the user-defined map function, which is applied to each triplet and can yield messages which are Aggregating using the user-defined reduce functions.
This one is the reason I defined malfunction. And this one is your user defined reduce function. So it basically applies the map function to all the triplets and then the aggregate those messages using this user defined reduce function. Now the newer version of this map produced triplets operator is the aggregate messages now moving ahead. Let's talk about Computing degree information operator. So one of the common aggregation task is Computing the degree of each vertex. That is the number of edges adjacent to each vertex. Now in the context of directed graph. It is often necessary to know the in degree out degree. Then the total degree of vertex. These kind of things are pretty much important and the graph Ops class contain a collection of operators to compute the degrees of each vertex.
So as you can see, we have maximum input degree than maximum output degree, then maximum degrees maximum degree will tell us the number of Maximum incoming edges then Max. Degree will tell us maximum number of output edges and this Max degree with actually tell us the number of input as well as output edges now moving ahead to next operator that is collecting Neighbors in some cases. It may be easier to express the computation by collecting neighboring vertices and their attribute at each vertex. Now, this can be easily accomplished using the collect neighbors ID and the collect neighbors operator. So basically your collect neighbor ID takes The Edge direction as the parameter and it returns a Vertex rdd that contains the array of vertex ID that is neighboring to the particular vertex now similarly The Collection neighbors again takes the edge directions as the input and it will return you the array with the vertex ID and the vertex attribute both now, let me quickly open my VM and let us go through the spark directory first. Let me first open my terminal so first I'll start the Do demons so for that I will go to her do phone directory genocide has been start or lot asset script file.
So let me check if the Hadoop demons are running or not. So as you can see that name, no data node secondary name node, the node manager and resource manager. All the Demons of Hadoop are up now. I will navigate to spark home. Let me first start this park demons. I See Spark demons are running alko first minimize this and let me take you to this park home. And this is my spot directories. I'll go inside now. Let me first show you the data which is by default present with your spark. So we'll open this in a new tab. So you can see we have two files in this Graphics data directory. Meanwhile, let me take you to the example code. So this is example and inside so main scalar. You can find the graphics directory and inside this Graphics directory you Some of the sample codes which are present over here. So I will take you to this aggregate messages example dots Kayla now meanwhile, let me open the data as well. So you'll be able to understand.
Now this is followers dot txt file. So basically you can imagine these are the edges which are representing the vertex. So this is what x 2 and this is vertex 1 then this is Vertex 4 and this is vertex 1 and similarly. So on these are representing those vertex and if you can remember I have already told you that inside graph loader class. There is a function called Edge list file which takes the edges from a file and then it construct the graph based. That now second you have this user dot txt. So these are basically the edges with the vertex ID. So vertex ID for this vertex is 1 then for this is 2 and so on and then this is the data which is attached or you can say the attribute of the edges. So these are the vertex ID which is 1 2 3 respectively and this is the data which is associated with your each vertex. So this is username and this might be the name of your user. Zur and so on now you can also see that in some of the cases the name of the user is missing. So as in this case the name of the user is missing these are the vertices or you can see the vertex ID and vertex attributes.
Now, let me take you through this aggregate messages example, so as you can see, we are giving the name of the packages over G Apache spark examples dot Graphics, then we are importing Graphics in that very important. Off class as well as this vertex rdd next we are using this graph generator. I'll tell you why we are using this graph generator and then we are using the spark session over here. So this is an example where we are using the aggregate messages operator to compute the average age of the more senior followers of each user. Okay. So this is the object of aggregate messages example. Now, this is the main function where we are first. Realizing this box session then the name of the application. So you have to provide the name of the application and this is get or create method now next you are initializing the spark context as SC now coming to the code.
So we are specifying a graph then this graph is containing double and N now. I just told you that we are importing craft generator. So this graph generator is to generate a random graph for Simplicity. So you would have multiple number of edges and vertices. Says then you are using this log normal graph. You're passing the spark context and you're specifying the number of vertices as hundred. So it will generate hundred vertices for you. Then what you are doing. You are specifying the map vertices and you're trying to map ID to double so what this would do this will basically map your ID to double then in next year trying to calculate the older followers where you have given it as vertex rdd and then put is nth and Also, your vertex already has sent as your vertex ID and your data is double which is associated with each of the vertex or you can say the vertex attribute. So you have this graph which is basically generated randomly and then you are performing aggregate messages.
So this is the aggregate messages operator now, if you can remember we first have the send messages, right? So inside this triplet, we are specifying a function that if the source attribute of the triplet is board. Destination attribute of the triplet. So basically it will return if the followers age is greater than the age of person whom he is following this tells the followers is is greater than the age of whom he is following. So in that situation, it will send message to the destination with vertex containing counter that is 1 and the age of the source attribute that is the age of the follower so first so you can see the age of the destination on is less than the age of source attribute. So it will tell you if the follower is older than the user or not. So in that situation will send one to the destination and we'll send the age of the source or you can see the edge of the follower then second.
I have told you that we have merged messages. So here we are adding the counter and the H in this reduce function. So now what we are doing we are dividing the total age of the number of older followers to Write an average age of older followers. So this is the reason why we have passed the attribute of source vertex firstly if we are specifying this variable that is average age of older followers. And then we are specifying the vertex rdd. So this will be double and then this older followers that is the graph which we are picking up from here and then we are trying to map the value. So in the vertex, we have ID and we have value so in this situation We are using this case class about count and total age. So what we are doing we are taking this total age and we are dividing it by count which we have gathered from this send message. And then we have aggregated using this reduce function.
We are again taking the total age of the older followers. And then we are trying to divide it by count to get the average age when at last we are trying to display the result and then we are stopping this park. So let me quickly open the terminal so I will go to examples so I'd examples I took you through the source directory where the code is present inside skaila. And then inside there is a spark directory where you will find the code but to execute the example you need to go to the jars territory. Now, this is the scale example jar which you need to execute. But before this, let me take you to the hdfs. So the URL is localhost. Colon 5 0 0 7 0 And we'll go to utilities then we'll go to browse the file system. So as you can see, I have created a user directory in which I have specified the username. That is Ed Eureka and inside Ed Eureka. I have placed my data directory where we have this graphics and inside the graphics.
We have both the file that is followers Dot txt and users dot txt. So in this program, we are not referring to these files but incoming examples will be referring to these files. So I would request you to first move it to this hdfs directory. So that spark can refer the files in data Graphics. Now, let me quickly minimize this and the command to execute is Spock – submit and then I'll pass this charge parameter and I'll provide the spark example jar. So this is the jar then I'll specify the class name. So to get the class name. I will go to the code. I'll first take the package name from here. And then I'll take the class name which is aggregated messages example, so this is my class. And as I told you have to provide the name of the application. So let me keep it as example and I'll hit enter. So now you can see the result.
So this is the followers and this is the average age of followers. So it is 34 Den. We have 52 which is the count of follower. And the average age is seventy six point eight that is it has 96 senior followers. And then the average age of the followers is ninety nine point zero, then it has four senior followers and the average age is 51. Then this vertex has 16 senior followers with the average age of 57 point five. 5 and so on you can see the result over here. So I hope now you guys are clear with aggregate messages how to use aggregate messages how to specify the send message then how to write the merge message. So let's quickly go back to the presentation. Now, let us quickly move ahead and look at some of the graph algorithms. So the first one is Page rank. So page rank measures the importance of each vertex in a graph assuming that an edge from U to V represents.
And recommendation or support of Vis importance by you for an example. Let's say if a Twitter user is followed by many others user will obviously rank high graphics comes with the static and dynamic implementation of pagerank as methods on page rank object and static page rank runs a fixed number of iterations, which can be specified by you while the dynamic page rank runs until the ranks converge what we mean by that is it Stop changing by more than a specified tolerance. So it runs until it have optimized the page rank of each of the vertices now graphs class allows calling these algorithms directly as methods on crafts class. Now, let's quickly go back to the VM. So this is the pagerank example. Let me open this file. So first we are specifying this Graphics package, then we are importing the graph loader.
So as you can Remember inside this graph loader class we have that edge list file operator, which will basically create the graph using the edges and we have those edges in our followers dot txt file now coming back to pagerank example now, we're importing the spark SQL Sparks session. Now, this is Page rank example object and inside which we have created a main class and we have similarly created this park session then Builders and we're specifying the app name which Is to be provided then we have get our grid method. So this is where we are initializing the spark context as you can remember. I told you that using this Edge list file method. We are basically creating the graph from the followers dot txt file. Now, we are running the page rank over here. So in rank it will give you all the page rank of the vertices that is inside this graph which we have just to reducing graph loader class. So if you're passing an integer as an an argument to the page rank, it will run that number iterations. Otherwise, if you're passing a double value, it will run until the convergence.
So we are running page rank on this graph and we have passed the vertices. Now after this we are trying to load the users dot txt file and then we are trying to play the line by comma then the field zero too long and we are storing the field one. So basically field zero. In your user txt is your vertex ID or you can see the ID of the user and field one is your username. So we are trying to load these two Fields now. We are trying to rank by username. So we are taking the users and we are joining the ranks. So this is where we are using the join operation. So Frank's by username. We are trying to attach those username or put those username with the page rank value. So we are taking the users then we are joining the ranks it is again, we are getting from this page Rank and then we are mapping the ID user name and rank. Second week sometime run some iterations over the craft and will try to converge it.
So after converging you can see the user and the rank. So the maximum rank is with Barack Obama, which is 1.45 then with Lady Gaga. It's 1.39 and then with order ski and so on. Let's go back to the slide. So now after page rank, let's quickly move ahead to Connected components the connected components algorithm labels each connected component of the graph with the ID of its lowest numbered vertex. So let us quickly go back to the VM. Now let's go inside the graphics directory and now we'll open this connect components example. So again, it's the same very important graph load and Spark session. Now, this is the connect components example object makes this is the main function and inside the main function. We are again specifying all those Sparks session then app name, then we have spark context. So it's similar. So again using this graph loader class and using this Edge. To file we are loading the followers dot txt file. Now in this graph.
We are using this connected components algorithm. And then we are trying to find the connected components now at last we are trying to again load this user file that is users Dot txt. And we are trying to join the connected components with the username so over here it is also the same thing which we have discussed in page rank, which is taking the field 0 and field one of your user dot txt file and a at last we are joining this users and at last year trying to join this users to connect component that is from here. Now. We are printing the CC by username collect. So let us quickly go ahead and execute this example as well. So let me first copy this object name. that's name this as example to so as you can see Justin Bieber has one connected component, then you can see this has three connected component. Then this has one connected component than Barack Obama has one connected component and so on.
So this basically gives you an idea about the connected components. Now, let's quickly move back to the slide will discuss about the third algorithm that is triangle counting. So basically a Vertex is a part of a triangle when it has two adjacent vertices with an edge between them. So it will form a triangle, right? And then that vertex is a part of a triangle now Graphics implements a triangle counting algorithm in the Triangle count object. Now that determines the number of triangles passing through each vertex providing a measure of clustering so we can compute the triangle count of the social network data set from the pagerank section 1 mode thing to note is that triangle count requires the edges. To be in a canonical orientation. That is your Source ID should always be less than your destination ID and the graph will be partition using craft or Partition by Method now, let's quickly go back.
So let me open the graphics directory again, and we'll see the triangle counting example. So again, it's the same and the object is triangle counting example, then the main function is same as well. Now we are again using this graph load of class and we are loading the followers dot txt which contains the edges as you can see here. We are using this Partition by argument and we are passing the random vertex cut, which is the partition strategy. So this is how you can go ahead and you can Implement a partition strategy. He is loading the edges in canonical order and partitioning the graph for triangle count. Now. We are trying to find out the triangle count for each vertex. So we have this try count variable and then we are using this triangle count algorithm and then we are specifying the vertices so it will execute triangle count over this graph which we have just loaded from follows dot txt file.
And again, we are basically joining usernames. So first we are Being the usernames again here. We are performing the join between users and try counts. So try counts is from here. And then we are again printing the value from here. So again, this is the same. Let us quickly go ahead and execute this triangle counting example. So let me copy this. I'll go back to the terminal. I'll limit as example 3 and change the class name. And I hit enter. So now you can see the triangle associated with Justin Bieber 0 then Barack Obama is one with odors kids one and with Jerry sick. It's fun. So for better understanding I would recommend you to go ahead and take this followers or txt. And you can create a graph by yourself. And then you can attach these users names with them and then you will get an idea about why it is giving the number as 1 or 0.
So again the graph which is connecting. In two and four is disconnect and it is not completing any triangles. So the value of these 3 are 0 and next year's second graph which is connecting your vertex 3 6 & 7 is completing one triangle. So this is the reason why these three vertices have values one now. Let me quickly go back. So now I hope that you guys are clear with all the concepts of graph operators then graph algorithms. Eames so now is the right time and let us look at a spa Graphics demo where we'll go ahead and we'll try to analyze the force go by data. So let me quickly go back to my VM. So let me first show you the website where you can go ahead and download the Fords go by data. So over here you can go to download the fort bike strip history data.
So you can go ahead and download this 2017 Ford's trip data. So I have already downloaded it. So to avoid the typos, I have already written all the commands so first let me go ahead and start the spark shell So I'm inside these Park shell now. Let me first import graphics and Spa body. So I've successfully imported graphics and Spark rdd. Now, let me create a spark SQL context as well. So I have successfully created this park SQL context. So this is basically for running SQL queries over the data frames. Now, let me go ahead and import the data. So I'm loading the data in data frame. So the format of file is CSV, then an option the header is already added. So that's why it's true. Then it will automatically infer this schema and then in the load parameter, I have specified the path of the file. So I'll quickly hit enter. So the data is loaded in the data frame to check. I'll use d f dot count so it will give me the count. So you can see it has 5 lakhs 19 2007 Red Rose now.
Let me click go back and I'll print the schema. So this is the schema the duration in second, then we have the start time end time. Then you have start station ID. Then you have start station name. Then you have start station latitude longitude then end station ID and station name then end station latitude and station longitude. Then your bike ID user type then the birth year of the member and the gender of the member now, I'm trying to create a data frame that is Gas stations so it will only create the station ID and station name which I'll be using as vertex. So here I am trying to create a data frame with the name of just stations where I am just selecting the start station ID and I'm casting it as float and then I'm selecting the start station name and then I'm using the distinct function to only keep the unique values.
So I quickly go ahead and hit enter. So again, let me go ahead and use this just stations and I will print the schema. So you can see there is station ID, and then there is start station name. It contains the unique values of stations in this just station data frame. So now again, I am taking this stations where I'm selecting these thought station ID and and station ID. Then I am using re distinct which will again give me the unique values and I'm using this flat map where I am specifying the iterables where we are taking the x0 that is your start station ID, and I am taking x 1 which is your ends. An ID and then again, I'm applying this distinct function that it will keep only the unique values and then at last we have to d f function which will convert it to data frame. So let me quickly go ahead and execute this. So I am printing this schema. So as you can see it has one column that is value and it has data type long.
So I have taken all the start and end station ID and using this flat map. I have retreated over all the start. And and station ID and then using the distinct function and taking the unique values and converting it to data frames so I can use the stations and using the station. I will basically keep each of the stations in a Vertex. So this is the reason why I'm taking the stations or you can say I am taking the unique stations from the start station ID and station ID so that I can go ahead and I can define vertex as the stations. So now we are creating our set of vertices and attaching a bit of metadata to each one of them which in our case is the name of the station. So as you can see we are creating this station vertices, which is again an rdd with vertex ID and strength. So we are using the station's which we have just created.
We are joining it with just stations at the station value should be equal to just station station ID. So as we have created stations, And just station so we are joining it. And then selecting the station ID and start station name then we are mapping row 0. And Row 1 so your row 0 will basically be your vertex ID and Row 1 will be the string. That is the name of your station to let me quickly go ahead and execute this. So let us quickly print this using collect forage println. So over here, we are basically attaching the edges or you can see we are creating the trip edges to all our individual rights and then we'll get the station values and then we'll add a dummy value of one. So as you can see that I am selecting the start station and and station from the DF which is the first data frame which we have loaded and then I am mapping it to row 0 + Row 1, which is your source and destination. And then and then I'm attaching a value one to each one of them. So I'll hit enter.
Now, let me quickly go ahead and print this station edges. So just taking the source ID of the vertex and destination ID of the vertex or you can say so station ID or vertex station ID and it is attaching value one to each one of them. So now you can go ahead and build your graph. But again as we discuss that we need a default station so you can have some situations where your edges might be indicating some vertices, but that vertices might not be present in your vertex re D. So for that situation, we need to create a default station. So I created a default station as missing station. So now we are all set. We can go ahead and create the graph. So the name of the graph is station graph. Then the vertices are stationed vertices which we have created which basically contains the station ID and station name and then we have station edges and at last we have default station.
So let me quickly go ahead and execute this. So now I need to cash this graph for faster access. So I'll use cash function. So let us quickly go ahead and check the number of vertices. So these are the number of vertices again, we can check the number of edges as well. So these are the number of edges. And to get a sanity check. So let's go ahead and check the number of records that are present in the data frame. So as you can see that the number of edges in our graph and the count in our data frame is similar, or you can see the same. So now let's go ahead and run page rank on our data so we can either run a set number of iterations or we can run it until the convergence. So in my case, I'll run it till convergence.
So it's rank then station graph then page rank. So has specified the double value so it will Tell convergence so let's wait for some time. So now that we have executed the pagerank algorithm. So we got the ranks which are attached to each vertices. So now let us quickly go ahead and look at the ranks. So we are joining ranks with station vertices and then we have sorting it in descending values and we are taking the first 10 rows and then we are printing them. So let's quickly go ahead and hit enter. So you can see these are the top 10 stations which have the most pagerank values so you can say it has more number of incoming trips. Now one question would be what are the most common destinations in the data set from location to location so we can do this by performing a grouping operator and adding The Edge counts together. So basically this will give a new graph except each Edge will now be the sum of all the semantically same edges.
So again, we are taking the station graph. We are performing Group by edges H1 and H2. So we are basically grouping edges H1 and H2. So we are aggregating them. Then we are using triplet and then we are sorting them in descending order again. And then we are printing the triplets from The Source vertex and the number of trips and then we are taking the destination attribute or you can see destination Vertex or you can see destination station. So you can see there are 1933 trips from San Francisco Ferry Building to the station then again, you can see there are fourteen hundred and eleven trips from San Francisco to this location. Then there are 1 0 to 5 trips from this station to San Francisco and it goes so on so now we have got a directed graph that mean our trip are directional from one location to another so now we can go ahead and find the number of Trades that Went to a specific station and then leave from a specific station. So basically we are trying to find the inbound and outbound values or you can say we are trying to find in degree and out degree of the stations.
So let us first calculate the in degrees from using station graph and I am using n degree operator. Then I'm joining it with the station vertices and then I'm sorting it again in descending order and then I'm taking the top 10 values. So let's quickly go ahead and hit enter. So these are the top 10 station and you can see the in degrees. So there are these many trips which are coming into these stations. Not similarly. We can find the out degree. Now again, you can see the out degrees as well. So these are the stations and these are the out degrees. So again, you can go ahead and perform some more operations over this graph. So you can go ahead and find the station which has most number of trips things that is most number of people coming into that station, but less people are leaving that station and again on the contrary you can find out the stations where there are more number of edges or you can set trip leaving those stations.
But there are less number of trips coming into those stations. So I guess you guys are now clear with Spa Graphics. Then we discuss the different types of crops then moving ahead. We discuss the features of grafx. They'll be discuss something about property graph. We understood what is property graph how you can create vertex how you can create edges how to use Vertex or DD H Rd D. Then we looked at some of the important vertex operations and at last we understood some of the graph algorithms. So I guess now you guys are clear about how to work with Bob Graphics. Today's video is on Hadoop versus park. Now as we know organizations from different domains are investing in big data analytics today. They're analyzing large data sets to uncover all hidden patterns unknown correlations market trends customer preferences and other useful business information.
Analogy of findings are helping organizations and more effective marketing new Revenue opportunities and better customer service and they're trying to get competitive advantages over rival organizations and other business benefits and Apache spark and Hadoop are the two of most prominent Big Data Frameworks and I see people often comparing these two technologies and that is what exactly we're going to do in this video. Now, we'll compare these two big data Frame Works based on on different parameters, but first it is important to get an overview about what is Hadoop. And what is Apache spark? So let me just tell you a little bit about Hadoop Hadoop is a framework to store and process large sets of data across computer clusters and Hadoop can scale from single computer system up to thousands of commodity systems that offer local storage and compute power and Hadoop is composed of modules that work together to create the entire Hadoop framework.
These are some of the components that we have in the entire Hadoop framework or the Hadoop ecosystem. For example, let me tell you about hdfs, which is the storage unit of Hadoop yarn, which is for resource management. There are different than a little tools like Apache Hive Pig nosql databases like Apache hbase. Even Apache spark and Apache Stone fits in the Hadoop ecosystem for processing big data in real-time for ingesting data we have Tools like Flume and scoop flumist used to ingest unstructured data or semi-structured data where scoop is used to ingest structured data into hdfs. If you want to learn more about these tools, you can go to Eddie rei'kas YouTube channel and look for Hadoop tutorial where everything has been explained in detail. Now, let's move to spark Apache spark is a lightning-fast cluster Computing technology that is designed for fast computation.
The main feature of spark is it's in memory clusters. Esther Computing that increases the processing of speed of an application fog perform similar operations to that of Hadoop modules, but it uses an in-memory processing and optimizes the steps the primary difference between mapreduce and Hadoop and Spark is that mapreduce users persistent storage and Spark uses resilient distributed data sets, which is known as rdds which resides in memory the different components and Sparkle. The spark origin the spark or is the base engine for large-scale parallel and distributed data processing further additional libraries which are built on top of the core allow diverse workloads for streaming SQL and machine learning spark or is also responsible for memory management and fault recovery scheduling and distributed and monitoring jobs and a cluster and interacting with the storage systems as well.
Next up. We have spark streaming. Spark streaming is the component of spark which is used to process real-time streaming data. It enables high throughput and fault-tolerant stream processing of live data streams. We have Sparks equal spark SQL is a new module in spark which integrates relational processing with Sparks functional programming API. It supports querying data either via SQL or via the hive query language. For those of you familiar with rdbms. Spark sequel will be an easy. Transition from your earlier tools where you can extend the boundaries of traditional relational data processing. Next up is Graphics Ralph X is the spark API for graphs and graph parallel computation and thus it extends the spark resilient distributed data sets with a resilient distributed property. Graph. Next is Park Emma lip for machine learning Emma lip stands for machine learning library spark. Emma live is used to perform machine.
In learning in Apache spark now since you've got an overview of both these two Frameworks, I believe that the ground is all set to compare Apache spark and Hadoop. Let's move ahead and compare Apache spark with Hadoop on different parameters to understand their strengths. We will be comparing these two Frameworks based on these parameters. Let's start with performance first Spark is fast because it has in-memory processing it can also use For data, that doesn't fit into memory Sparks in-memory processing delivers near real-time analytics and this makes Park suitable for credit card processing system machine learning security analysis and processing data for iot sensors. Now, let's talk about hadoop's performance. Now Hadoop has originally designed to continuously gather data from multiple sources without worrying about the type of data and storing it across distributed environment and mapreduce.
Use uses batch processing mapreduce was never built for real-time processing main idea behind yarn is parallel processing over distributed data set the problem with comparing the two is that they have different way of processing and the idea behind the development is also Divergent next ease-of-use spark comes with a user-friendly apis for Scala Java Python and Sparks equal spark SQL is very similar to SQL. So it becomes easier for a sequel developers to learn it spark also provides an interactive shell for developers to query and perform other actions and have immediate feedback. Now, let's talk about Hadoop. You can ingest data in Hadoop easily either by using shell or integrating it with multiple tools, like scoop and Flume and yarn is just a processing framework that can be integrated with multiple tools like Hive and pig for Analytics. I visit data warehousing component which performs Reading Writing and managing large data set in a distributed environment using sql-like interface to conclude here. Both of them have their own ways to make themselves user-friendly.
Now, let's come to the cost Hadoop and Spark are both Apache open source projects. So there's no cost for the software cost is only associated with the infrastructure both the products are designed in such a way that Can run on commodity Hardware with low TCO or total cost of ownership. Well now you might be wondering the ways in which they are different. They're all the same storage and processing in Hadoop is disc-based and Hadoop uses standard amounts of memory. So with Hadoop, we need a lot of disk space as well as faster transfer speed Hadoop also requires multiple systems to distribute the disk input output, but in case of Apache spark due to its in-memory processing it requires a lot of memory, but it can deal with the standard. Speed and amount of disk as disk space is a relatively inexpensive commodity and since Park does not use disk input output for processing instead. It requires large amounts of RAM for executing everything in memory. So spark systems incurs more cost but yes one important thing to keep in mind is that Sparks technology reduces the number of required systems, it needs significantly fewer systems that cost more so there will be a point at which spark reduces the cost per unit of the computation even with the additional RAM requirement.
There are two types of data processing batch processing and stream processing batch processing has been crucial to the Big Data World in simplest term batch processing is working with high data volumes collected over a period in batch processing data is first collected then processed and then the results are produced at a later stage and batch. Is it efficient way of processing large static data sets? Generally we perform batch processing for archived data sets for example, calculating average income of a country or evaluating the change in e-commerce in the last decade now stream processing stream processing is the current Trend in the Big Data World need of the hour is speed and real-time information, which is what stream processing does batch processing does not allow. Businesses to quickly react to changing business needs and real-time stream processing has seen a rapid growth in that demand now coming back to Apache Spark versus Hadoop yarn is basically a batch processing framework when we submit a job to yarn. It reads data from the cluster performs operation and write the results back to the cluster and then it again reads the updated data performs the next operation and write the results back to the cluster and Off on the other hand spark is designed to cover a wide range of workloads such as batch application iterative algorithms interactive queries and streaming as well.
Now, let's come to fault tolerance Hadoop and Spark both provides fault tolerance, but have different approaches for hdfs and yarn both Master demons. That is the name node in hdfs and resource manager in the arm checks the heartbeat of the slave demons. The slave demons are data nodes and node managers. So if any slave demon fails, the master demons reschedules all pending an in-progress operations to another slave now this method is effective but it can significantly increase the completion time for operations with single failure also and as Hadoop uses commodity hardware and another way in which hdfs ensures fault tolerance is by replicating data. Now let's talk about spark as we discussed earlier rdds are resilient distributed data sets are building blocks of Apache spark and rdds are the one which provide fault tolerant to spark. They can refer to any data set present and external storage system like hdfs Edge base shared file system Etc.
They can also be operated parallely rdds can persist a data set and memory across operations. It's which makes future actions 10 times much faster if rdd is lost it will automatically get recomputed by using the original Transformations. And this is how spark provides fault tolerance and at the end. Let us talk about security. Well Hadoop has multiple ways of providing security Hadoop supports Kerberos for authentication, but it is difficult to handle nevertheless. It also supports third-party vendors like ldap. For authentication, they also offer encryption hdfs supports traditional file permissions as well as Access Control lists, Hadoop provides service level authorization which guarantees that clients have the right permissions for job submission spark currently supports authentication via a shared secret spark can integrate with hdfs and it can use hdfs ACLS or Access Control lists and file level permissions sparking also run.
Yarn, leveraging the capability of Kerberos. Now. This was the comparison of these two Frameworks based on these following parameters. Now, let us understand use cases where these Technologies fit best use cases were Hadoop fits best. For example, when you're analyzing archive data yarn allows parallel processing over huge amounts of data parts of data is processed parallely and separately on different data nodes and gathers result from each node manager in cases when instant results are not required now Hadoop mapreduce is a good and economical solution for batch processing. However, it is incapable of processing data in real-time use cases where Spark fits best in real-time Big Data analysis, real-time data analysis means processing data that is getting generated by the real-time event streams coming in at the rate of Billions of events per second the strength of spark lies in its abilities to support streaming of data along with distributed processing and Spark claims to process data hundred times faster than mapreduce while 10 times faster with the discs.
It is used in graph processing spark contains a graph computation Library called Graphics which simplifies our life in memory computation along with inbuilt graph support improves the performance. Performance of algorithm by a magnitude of one or two degrees over traditional mapreduce programs. It is also used in iterative machine learning algorithms almost all machine learning algorithms work iteratively as we have seen earlier iterative algorithms involve input/output bottlenecks in the mapreduce implementations mapreduce uses coarse-grained tasks that are too heavy for iterative algorithms spark caches the intermediate data. I said after each iteration and runs multiple iterations on the cache data set which eventually reduces the input output overhead and executes the algorithm faster in a fault-tolerant manner sad the end which one is the best the answer to this is Hadoop and Apache spark are not competing with one another. In fact, they complement each other quite well, how do brings huge data sets under control by commodity systems? Systems and Spark provides a real-time in-memory processing for those data sets.
When we combine Apache Sparks ability. That is the high processing speed and advanced analytics and multiple integration support with Hadoop slow cost operation on commodity Hardware. It gives the best results Hadoop compliments Apache spark capabilities spark not completely replace a do but the good news is that the demand of spark is currently at an all-time. Hi, if you want to learn more about the Hadoop ecosystem tools and Apache spark, don't forget to take a look at the editor Acres YouTube channel and check out the big data and Hadoop playlist. Welcome everyone in today's session on kafka's Park streaming. So without any further delay, let's look at the agenda first. We will start by understanding. What is Apache Kafka? Then we will discuss about different components of Apache Kafka and it's architecture.
Further we will look at different Kafka commands. After that. We'll take a brief overview of Apache spark and will understand different spark components. Finally. We'll look at the demo where we will use spark streaming with Apache caf-pow. Let's move to our first slide. So in a real time scenario, we have different systems of services, which will be communicating with each other and the data pipelines are the ones which are establishing connection between two servers or two systems. Now, let's take an example of e-commerce. Except site where it can have multiple servers at front end like Weber application server for hosting application. It can have a chat server for the customers to provide chart facilities. Then it can have a separate server for payment Etc.
Similarly organization can also have multiple server at the back end which will be receiving messages from different front end servers based on the requirements. Now they can have a database server which will be storing the records then they can have security systems for user authentication and authorization then they can have Real-time monitoring server, which is basically used for recommendations. So all these data pipelines becomes complex with the increase in number of systems and adding a new system or server requires more data pipelines, which will again make the data flow more complicated and complex. Now managing. These data pipelines also become very difficult as each data pipeline has their own set of requirements for example data pipelines, which handles transaction should be more fault tolerant and robust on the other hand. Clickstream data pipeline can be more fragile. So adding some pipelines or removing some pipelines becomes more difficult from the complex system. So now I hope that you would have understood the problem due to which misting systems was originated. Let's move to the next slide and we'll understand how Kafka solves this problem now measuring system reduces the complexity of data pipelines and makes the communication between systems more simpler and manageable using messaging system.
Now, you can easily stablish remote Education and send your data easily across Netbook. Now a different systems may use different platforms and languages and messaging system provides you a common Paradigm independent of any platformer language. So basically it decouples the platform on which a front end server as well as your back-end server is running you can also stablish a no synchronous communication and send messages so that the sender does not have to wait for the receiver to process the messages. Now one of the benefit of messaging system is that you can Reliable communication. So even when the receiver and network is not working properly. Your messages wouldn't get lost not talking about cough cough cough cough decouples the data pipelines and solves the complexity problem the applications which are producing messages to Kafka are called producers and the applications which are consuming those messages from Kafka are called consumers. Now, as you can see in the image the front end server, then your application server will burn application server to and chat server. I using messages to Kafka and these are called producers and your database server security systems real-time monitoring server than other services and data warehouse.
These are basically consuming the messages and are called consumers. So your producer sends the message to Kafka and then cough cash to those messages and consumers who want those messages can subscribe and receive them now you can also have multiple subscribers to a single category of messages. So you Database server and your security system can be consuming the same messages which is produced by application server 1 and again adding a new consumer is very easy. You can go ahead and add a new consumer and just subscribe to the message categories that is required. So again, you can add a new consumer say consumer one and you can again go ahead and subscribe to the category of messages which is produced by application server one. So, let's quickly move ahead. Let's talk about a Bocce Kafka so party.
Kafka is a distributed publish/subscribe messaging system messaging traditionally has two models queuing and publish/subscribe in a queue a pool of consumers. May read from a server and each record only goes to one of them whereas in publish/subscribe. The record is broadcasted to all consumers. So multiple consumers can get the record the Kafka cluster is distributed and have multiple machines running in parallel. And this is the reason why calf pies fast scalable and fault. Now let me tell you that Kafka is developed at LinkedIn and later. It became a part of Apache project. Now, let us look at some of the important terminologies. So we'll first start with topic. So topic is a category or feed name to which records are published and Topic in Kafka are always multi subscriber. That is a topic can have zero one or multiple consumers that can subscribe the topic and consume the data written to it for an example.
You can have serious record getting published in sales, too. Topic you can have product records which is getting published in product topic and so on this will actually segregate your messages and consumer will only subscribe the topic that they need and again you consumer can also subscribe to two or more topics. Now, let's talk about partitions. So Kafka topics are divided into a number of partitions and partitions allow you to paralyze a topic by splitting the data in a particular topic across multiple. Brokers which means each partition can be placed on separate machine to allow multiple consumers to read from a topic parallelly. So in case of serious topic you can have three partition partition 0 partition 1 and partition to from where three consumers can read data parallel. Now moving ahead. Let's talk about producers. So producers are the one who publishes the data to topics of the choice. Then you have consumers so consumers can subscribe to one or more topic.
And consume data from that topic now consumers basically label themselves with a consumer group name and each record publish to a topic is delivered to one consumer instance within each subscribing consumer group. So suppose you have a consumer group. Let's say consumer Group 1 and then you have three consumers residing in it. That is consumer a consumer be an consumer see now from the seals topic. Each record can be read once by consumer group Fun and it And either be read by consumer a or consumer be or consumer see but it can only be consumed once by the single consumer group that is consumer group one. But again, you can have multiple consumer groups which can subscribe to a topic where one record can be consumed by multiple consumers. That is one consumer from each consumer group. So now let's say you have a consumer one and consumer group to in consumer Group 1 we have to consumer that is consumer a a and consumer be and consumer group to we have to Consumers consumer key and consumer to be so if consumer Group 1 and consumer group 2 are consuming messages from topic sales.
So the single record will be consumed by consumer group one as well as consumer group 2 and a single consumer from both the consumer group will consume the record once so, I guess you are clear with the concept of consumer and consumer group Now consumer instances can be a separate process or separate machines. No talking about Brokers Brokers are nothing but a single machine in the CAF per cluster and zookeeper is another Apache open source project. It's Tuesday metadata information related to Kafka cluster. Like Brokers information topics details Etc. Zookeeper is basically the one who is managing the whole Kafka cluster. Now, let's quickly go to the next slide. So suppose you have a topic. Let's assume this is topic sales and you have for partition so you have Partition 0 partition 1 partition to and partition three now you have five Brokers over here. Now, let's take the case of partition 1 so if the replication factor is 3 it will have 3 copies which will reside on different Brokers. So when the replica is on broker to next is on broker 3 and next is on brokered 5 and as you can see repl 5, so this 5 is from this broker 5.
So the ID of the replica is same as the ID of The broker that hosts it now moving ahead. One of the replica of partition one will serve as the leader replica. So now the leader of partition one is replica five and any consumer coming and consuming messages from partition one will be solved by this replica. And these two replicas is basically for fault tolerance. So that once you're broken five goes off or your disc becomes corrupt, so your replica 3 or replica to to one of them will again serve as a leader and this is basically decided on the basis of most in sync replica. So the replica which will be most in sync with this replica will become the next leader. So similarly this partition 0 may decide on broker one broker to and broker three again your partition to May reside on broke of for group of five and say broker one and then your third partition might reside on these three brokers.
So suppose that this is the leader for partition to this is the leader for partition 0 this is the leader for partition 3. This is the leader for partition 1 right so you can see that for consumers can consume data pad Ali from these Brokers so it can consume data from partition to this consumer can consume data from partition 0 and similarly for partition 3 and partition fun now by maintaining the replica basically helps. Sin fault tolerance and keeping different partition leaders on different Brokers basically helps in parallel execution or you can say baddeley consuming those messages. So I hope that you guys are clear about topics partitions and replicas now, let's move to our next slide. So this is how the whole Kafka cluster looks like you have multiple producers, which is again producing messages to Kafka.
Then this whole is the Kafka cluster where you have two nodes node one has to broker. Joker one and broker to and the Note II has two Brokers which is broker three and broke of for again consumers will be consuming data from these Brokers and zookeeper is the one who is managing this whole calf cluster. Now, let's look at some basic commands of Kafka and understand how Kafka Works how to go ahead and start zookeeper how to go ahead and start Kafka server and how to again go ahead and produce some messages to Kafka and then consume some messages to Kafka. So let me quickly. on my VM So let me quickly open the terminal. Let me quickly go ahead and execute sudo GPS so that I can check all the demons that are running in my system.
So you can see I have named no data node resource manager node manager job is to server. So now as all the hdfs demons are burning let us quickly go ahead and start Kafka services. So first I will go to Kafka home. So let me show you the directory. So my Kafka is in user lib. Now. Let me quickly go ahead and start zookeeper service. But before that, let me show you zookeeper dot properties file. So decline Port is 2 1 8 1 so my zookeeper will be running on Port to 181 and the data directory in which my zookeeper will store all the metadata is slash temp / zookeeper. So let us quickly go ahead and start zookeeper and the command is bins zookeeper server start. So this is the script file and then I'll pass the properties file which is inside config directory and a little Meanwhile, let me open another tab. So here I will be starting my first Kafka broker.
But before that let me show you the properties file. So we'll go in config directory again, and I have server dot properties. So this is the properties of my first Kafka broker. So first we have server Basics. So here the broker idea of my first broker is 0 then the port is 9:09 to on which my first broker will be running. So it contains all the socket server settings then moving ahead. We have log base X. So in that log Basics, this is log directory, which is / them / Kafka – logs so over here my Kafka will store all those messages or records, which will be produced by The Producers. So all the records which belongs to broker 0 will be stored at this location. Now, the next section is internal topic settings in which the offset topical.
application factor is 1 then transaction State log replication factor is 1 Next we have log retention policy. So the log retention ours is 168. So your records will be stored for 168 hours by default and then it will be deleted. Then you have zookeeper properties where we have specified zookeeper connect and as we have seen in Zookeeper dot properties file that are zookeeper will be running on Port 2 1 8 1 so we are giving the address of Zookeeper that is localized to one eight one and at last we have group. Coordinator setting so let us quickly go ahead and start the first broker. So the script file is Kafka server started sh and then we have to give the properties file, which is server dot properties for the first broker. I'll hit enter and meanwhile, let me open another tab. now I'll show you the next properties file, which is Server 1. Properties. So the things which you have to change for creating a new broker is first you have to change the broker ID. So my earlier book ID was 0 the new broker ID is 1 again, you can replicate this file and for a new server, you have to change the broker idea to to then you have to change the port because on 9:09 to already. My first broker is running that is broker 0 so my broker.
Should connect to a different port and here I have specified nine zero nine three. Next thing what you have to change is the log directory. So here I have added a – 1 to the default log directory. So all these records which is stored to my broker one will be going to this particular directory that is slashed and slashed cough call logs – 1 And rest of the things are similar, so let me quickly go ahead and start second broker as well. And let me open one more terminal. And I'll start broker to as well. So the Zookeeper started then procurve one is also started and this is broker to which is also started and this is proof of 3. So now let me quickly minimize this and I'll open a new terminal. Now first, let us look at some commands later to Kafka topics. So I'll quickly go ahead and create a topic. So again, let me first go to my Kafka home directory. Then the script file is Kafka top it dot sh, then the first parameter is create then we have to give the address of zoo keeper because zookeeper is the one who is actually containing all the details related to your topic. So the address of my zookeeper is localized to one eight one then we'll give the topic name.
So let me name the topic as Kafka – spark next we have to specify the replication factor of the topic. So it will replicate all the partitions inside the topic that many times. So replication – Factor as we have three Brokers, so let me keep it as 3 and then we have partitions. So I will keep it as three because we have three Brokers running and our consumer can go ahead and consume messages parallely from three Brokers and let me press enter. So now you can see the topic is created. Now, let us quickly go ahead and list all the topics. So the command for listing all the topics is dot slash bin again. We'll open cough car topic script file then – – list and again will provide the address of Zookeeper.
So do again list the topic we have to first go to the CAF core topic script file. Then we have to give – – list parameter and next we have to give the zookeepers. Which is localhost 181 I'll hit enter. And you can see I have this Kafka – spark the kafka's park topic has been created. Now. Let me show you one more thing again. We'll go to when cuff card topics not sh and we'll describe this topic. I will pass the address of zoo keeper, which is localhost to one eight one and then I'll pause the topic name, which is Kafka – Spark So now you can see here. The topic is cough by spark. The partition count is 3 the replication factor is 3 and the config is as follows. So here you can see all the three partitions of the topic that is partition 0 partition 1 and partition 2 then the leader for partition 0 is broker to the leader for partition one is broker 0 and leader for partition to is broker one so you can see we have different partition leaders residing on And Brokers, so this is basically for load balancing.
So that different partition could be served from different Brokers and it could be consumed parallely again, you can see the replica of this partition is residing in all the three Brokers same with Partition 1 and same with Partition to and it's showing you the insync replica. So in synch replica, the first is to then you have 0 and then you have 1 and similarly with Partition 1 and 2. So now let us quickly. Go ahead. I'll reduce this to 1/2. Wake me up in one more terminal. The reason why I'm doing this is that we can actually produce message from One console and then we can receive the message in another console. So for that I'll start cough cough console producer first. So the command is dot slash bin cough cough console producer dot sh and then in case of producer you have to give the parameter as broker – list, which is Localhost 9:09 to you can provide any of the Brokers that is running and it will again take the rest of the Brokers from there.
So you just have to provide the address of one broker. You can also provide a set of Brokers so you can give it as localhost colon. 9:09 2 comma Lu closed: 9 0 9 3 and similarly. So here I am passing the address of the first broker now next I have to mention the topic. So topic is Kafka Spark. And I'll hit enter. So my console producer is started. Let me produce a message saying hi. Now in the second terminal I will go ahead and start the console consumer. So again, the command is Kafka console consumer not sh and then in case of consumer, you have to give the parameter as bootstrap server. So this is the thing to notice guys that in case of producer you have to give the broker list by in. So of consumer, you have to give bootstrap server and it is again the same that is localhost 9:09 to which the address of my broker 0 and then I will give the topic which is cuff cost park now adding this parameter that is from – beginning will basically give me messages stored in that topic from beginning. Otherwise, if I'm not giving this parameter – – from beginning I'll only I'm the recent messages that has been produced after starting this console consumer.
So let me hit enter and you can see I'll get a message saying hi first. Well, I'm sorry guys. The topic name I have given is not correct. Sorry for my typo. Let me quickly corrected. And let me hit enter. So as you can see, I am receiving the messages. I received High then let me produce some more messages. So now you can see all the messages that I am producing from console producer is getting consumed by console consumer. Now this console producer as well as console consumer is basically used by the developers to actually test the Kafka cluster. So what happens if you are if there is a producer which is running and which is producing those messages to Kafka then you can go ahead and you can start console consumer and check whether the producer is producing.
Messages or not or you can again go ahead and check the format in which your message are getting produced to the topic. Those kind of testing part is done using console consumer and similarly using console producer. You do something like you are creating a consumer so you can go ahead you can produce a message to Kafka topic and then you can check whether your consumer is consuming that message or not. This is basically used for testing now, let us quickly go ahead and close this. Now let us get back to our slides now. I have briefly covered Kafka and the concepts of Kafka so here basically I'm giving you a small brief idea about what Kafka is and how Kafka works now as we have understood why we need misting systems. What is cough cough? What are different terminologies and Kafka how Kafka architecture works and we have seen some of the basic cuff Pokemons.
So let us now understand. What is Apache spark. So basically Apache spark is an Source cluster Computing framework for near real-time processing now spark provides an interface for programming the entire cluster with implicit data parallelism and fault tolerance will talk about how spark provides fault tolerance but talking about implicit data parallelism. That means you do not need any special directives operators or functions to enable parallel execution. It sparked by default provides the data parallelism spark is designed to cover a wide range of workloads such. As batch applications iterative algorithms interactive queries machine learning algorithms and streaming. So basically the main feature of spark is it's in memory cluster Computing that increases the processing speed of the application. So what spark does spark does not store the data in discs, but it does it transforms the data and keep the data in memory. So that quickly multiple operations can be applied over the data and the final result is only stored in the disk now a On-site Spa can also do batch processing hundred times faster than mapreduce.
And this is the reason why a patches Park is to go to tool for big data processing in the industry. Now, let's quickly move ahead and understand how spark does this so the answer is rdd that is resilient distributed data sets now an rdd is a read-only partitioned collection of records and you can see it is a fundamental data structure of spa. So basically, ERD is an immutable distributed collection of objects. So each data set in rdd is divided into logical partitions, which may be computed on different nodes of the cluster now already can contain any type of python Java or scale objects. Now talking about the fault tolerance rdd is a fault-tolerant collection of elements that can be operated on in parallel. Now, how are ready does that if rdd is lost it will automatically be recomputed by using original. Nations and this is how spot provides fault tolerance. So I hope that you guys are clear that house Park provides fault tolerance. Now let's talk about how we can create rdds. So there are two ways to create rdds first is paralyzing an existing collection in your driver program, or you can refer a data set in an external storage systems such as shared file system.
It can be hdfs Edge base or any other data source offering a Hadoop input format now spark makes use of the concept of rdd to achieve fast and efficient operations. Now, let's quickly move ahead and look how already So first we create an rdd which you can create either by referring to an external storage system. And then once you create an rdd you can go ahead and you can apply multiple Transformations over that are ready. Like will perform filter map Union Etc. And then again, it gives you a new rdd or you can see the transformed rdd and at last you apply some action and get the result now this action can be Count first a can collect all those kind of functions. So now this is a brief idea about what is rdd and how rdd works. So now let's quickly move ahead and look at the different workloads that can be handled by Apache spark. So we have interactive streaming analytics. Then we have machine learning. We have data integration. We have spark streaming and processing.
So let us talk about them one by one first is spark streaming and processing. So now basically, you know data arrives at a steady rate. Are you can say at a continuous streams, right? And then what you can do you can again go ahead and store the data set in disk and then you can actually go ahead and apply some processing over it some analytics over it and then get some results out of it, but this is not the scenario with each and every case. Let's take an example of financial transactions where you have to go ahead and identify and refuse potential fraudulent transactions. Now if you will go ahead and store the data stream and then you will go ahead and apply some Assessing it would be too late and someone would have got away with the money. So in that scenario what you need to do. So you need to quickly take that input data stream.
You need to apply some Transformations over it and then you have to take actions accordingly. Like you can send some notification or you can actually reject that fraudulent transaction something like that. And then you can go ahead and if you want you can store those results or data set in some of the database or you can see some of the file system. So we have some scenarios. Very we have to actually process the stream of data and then we have to go ahead and store the data or perform some analysis on it or take some necessary actions. So this is where Spark streaming comes into picture and Spark is a best fit for processing those continuous input data streams. Now moving to next that is machine learning now, as you know, that first we create a machine learning model then we continuously feed those incoming data streams to the model. And we get some continuous output based on the input values. Now, we reuse intermediate results across multiple computation in multi-stage applications, which basically includes substantial overhead due to data replication disk I/O and sterilization which makes the system slow. Now what Spock does spark rdd will store intermediate result in a distributed memory instead of a stable storage and make the system faster.
So as we saw in spark rdd all the Transformations will be applied over there and all the transformed rdds will be stored in the memory itself so we can quickly go ahead and apply some more iterative algorithms over there and it does not take much time in functions like data replication or disk I/O so all those overheads will be reduced now you might be wondering that memories always very less. So what if the memory gets over so if the distributed memory is not sufficient to store intermediate results, then it will store those results. On the desk. So I hope that you guys are clear how sparks perform this iterative machine learning algorithms and why spark is fast, let's look at the next workload. So next workload is interactive streaming analytics. Now as we already discussed about streaming data so user runs ad hoc queries on the same subset of data and each query will do a disk I/O on the stable storage which can dominate applications execution time.
So, let me take an example. Data scientist. So basically you have continuous streams of data, which is coming in. So what your data scientists would do. So do your data scientists will either ask some questions execute some queries over the data then view the result and then he might alter the initial question slightly by seeing the output or he might also drill deeper into results and execute some more queries over the gathered result. So there are multiple scenarios in which your data scientist would be running some interactive queries. On the streaming analytics now house path helps in this interactive streaming analytics. So each transformed our DD may be recomputed each time. You run an action on it, right? And when you persist an rdd in memory in which case Park will keep all the elements around on the cluster for faster access and whenever you will execute the query next time over the data, then the query will be executed quickly and it will give you a instant result, right? So I hope that you guys are clear how spark helps in interactive streaming analytics. Now, let's talk about data integration.
So basically as you know, that in large organizations data is basically produced from different systems across the business and basically you need a framework which can actually integrate different data sources. So Spock is the one which actually integrate different data sources so you can go ahead and you can take the data from Kafka Cassandra flu. Umm hbase then Amazon S3. Then you can perform some real time analytics over there or even say some near real-time analytics over there. You can apply some machine learning algorithms and then you can go ahead and store the process result in Apache hbase. Then msql hdfs. It could be your Kafka. So spark basically gives you a multiple options where you can go ahead and pick the data from and again, you can go ahead and write the data into now. Let's quickly move ahead and we'll talk. About different spark components so you can see here. I have a spark or engine.
So basically this is the core engine and on top of this core engine. You have spark SQL spark streaming then MLA, then you have graphics and the newest Parker. Let's talk about them one by one and we'll start with spark core engine. So spark or engine is the base engine for large-scale parallel and distributed data processing additional libraries, which are built on top of the core allows divers workloads Force. Streaming SQL machine learning then you can go ahead and execute our on spark or you can go ahead and execute python on spark those kind of workloads. You can easily go ahead and execute. So basically your spark or engine is the one who is managing all your memory, then all your fault recovery your scheduling your Distributing of jobs and monitoring jobs on a cluster and interacting with the storage system. So in in short we can see the spark or engine is the heart of Spock and on top of this all of these libraries are there so first, let's talk about spark streaming. So spot streaming is the component of Spas which is used to process real-time streaming data as we just discussed and it is a useful addition to spark core API.
Now it enables high throughput and fault tolerance stream processing for live data streams. So you can go ahead and you can perform all the streaming data analytics using this spark streaming then You have Spock SQL over here. So basically spark SQL is a new module in spark which integrates relational processing of Sparks functional programming API and it supports querying data either via SQL or SQL that is – query language. So basically for those of you who are familiar with rdbms Spock SQL is an easy transition from your earlier tool where you can go ahead and extend the boundaries of traditional relational data processing now talking about graphics. So Graphics is the spaag API for graphs and crafts parallel computation. It extends the spark rdd with a resilient distributed property graph a talking at high level. Basically Graphics extend the graph already abstraction by introducing the resilient distributed property graph, which is nothing but a directed multigraph with properties attached to each vertex and Edge next we have spark are so basically it provides you packages for our language and then you can go ahead and Leverage Park power with our shell next you have spark MLA.
So ml is basically stands for machine learning library. So spark MLM is used to perform machine learning in Apache spark. Now many common machine learning and statical algorithms have been implemented and are shipped with ML live which simplifies large scale machine learning pipelines, which basically includes summary statistics correlations classification and regression collaborative filtering techniques. New cluster analysis methods then you have dimensionality reduction techniques. You have feature extraction and transformation functions. When you have optimization algorithms, it is basically a MLM package or you can see a machine learning package on top of spa. Then you also have something called by spark, which is python package for spark there. You can go ahead and leverage python over spark. So I hope that you guys are clear with different spark components.
So before moving to cough gasp, ah, Exclaiming demo. So I have just given you a brief intro to Apache spark. If you want a detailed tutorial on Apache spark or different components of Apache spark like Apache spark SQL spark data frames or spark streaming Spa Graphics Spock MLA, so you can go to editor Acres YouTube channel again. So now we are here guys. I know that you guys are waiting for this demo from a while. So now let's go ahead and look at calf by spark streaming demo. So let me quickly go ahead and open. my virtual machine and I'll open a terminal. So let me first check all the demons that are running in my system. So my zookeeper is running name node is running data node is running. The my resource manager is running all the three cough cough Brokers are running then node manager is running and job is to server is running. So now I have to start my spark demons. So let me first go to the spark home and start this part demon.
The command is a spin start or not. Sh. So let me quickly go ahead and execute sudo JPS to check my spark demons. So you can see master and vocal demons are running. So let me close this terminal. Let me go to the project directory. So basically, I have two projects. This is cough card transaction producer. And the next one is the spark streaming Kafka master. So first we will be producing messages from Kafka transaction producer and then we'll be streaming those records which is basically produced by this producer using the spark streaming Kafka master. So first, let me take you through this cough card transaction producer. So this is our cornbread XML file. Let me open it with G edit. So basically this is a me. Project and and I have used spring boot server. So I have given Java version as a you can see cough cough client over here and the version of Kafka client, then you can see I'm putting Jackson data bind.
Then ji-sun and then I am packaging it as a war file that is web archive file. And here I am again specifying the spring boot Maven plugins, which is to be downloaded. So let me quickly go ahead and close this and we'll go to this Source directory and then we'll go inside main. So basically this is the file that is sales Jan 2009 file. So let me show you the file first. So these are the records which I'll be producing to the Kafka. So the fields are transaction date than product price payment type the name city state country account created then last login latitude and longitude. So let me close this file and then the application dot. Yml is the main property file. So in this application dot yml am specifying the bootstrap server, which is localhost 9:09 to than am specifying the Pause which again resides on localhost 9:09 to so here. I have specified the broker list now next I have product topic.
So the topic of the product is transaction. Then the partition count is 1 so basically you're a cks config controls the criteria under which requests are considered complete and the all setting we have specified will result in blocking on the full Committee of the record. It is the slowest burn the most durable setting not talking about retries. So it will retry Thrice then we have mempool size and we have maximum pool size, which is basically for implementing Java threads and at last we have the file path. So this is the path of the file, which I have shown you just now so messages will be consumed from this file. Let me quickly close this file and we'll look at application but properties so here we have specified the properties for Springboard server. So we have server context path. That is /n Eureka. Then we have spring application name that is Kafka producer. We have server Port that is double line W8 and the spring events timeout is 20.
So let me close this as well. Let's go back. Let's go inside Java calm and Eureka Kafka. So we'll explore the important files one by one. So let me first take you through this dito directory. And over here, we have transaction dot Java. So basically here we are storing the model. So basically you can see these are the fields from the file, which I have shown you. So we have transaction date. We have product price payment type name city state country and so on so we have created variable for each field. So what we are doing we are basically creating a getter and Setter function for all these variables. So we have get transaction ID, which will basically returned Transaction ID then we have sent transaction ID, which will basically send the transaction ID. Similarly. We have get transaction date for getting the transaction date.
Then we have set transaction date and it will set the transaction date using this variable. Then we have get products and product get price set price and all the getter and Setter methods for each of the variable. This is the Constructor. So here we are taking all the parameters like transaction date product price. And then we are setting the value of each of the variables using this operator. So we are setting the value for transaction date product price payment and all of the fields that is present over there. Next. We are also creating a default Constructor and then over here. We are overriding the tostring method and what we are doing we are basically The transaction details and we are returning transaction date and then the value of transaction date product then body of product price then value of price and so on for all the fields. So basically this is the model of the transaction so we can go ahead and we can create object of this transaction and then we can easily go ahead and send the transaction object as the value. So this is the main reason of creating this transaction model, LOL.
Me quickly, go ahead and close this file. Let's go back and let's first take a look at this config. So this is Kafka properties dot Java. So what we did again as I have shown you the application dot yml file. So we have taken all the parameters that we have specified over there. That is your bootstrap product topic partition count then Brokers filename and thread count. So all these properties then you have file path then all these Days, we have taken we have created a variable and then what we are doing again, we are doing the same thing as we did with our transaction model. We are creating a getter and Setter method for each of these variables. So you can see we have get file path and we are returning the file path. Then we have set file path where we are setting the file path using this operator. Similarly. We have get product topics at product topic then we have greater incentive for third count. We have greater incentive. for bootstrap and all those properties No, we can again go ahead and call this cough cough properties anywhere and then we can easily extract those values using getter methods.
So let me quickly close this file and I'll take you to the configurations. So in this configuration what we are doing we are creating the object of Kafka properties as you can see, so what we are doing then we are again creating a property's object and then we are setting the properties so you can see that we are Setting the bootstrap server config and then we are retrieving the value using the cough cough properties object. And this is the get bootstrap server function. Then you can see we are setting the acknowledgement config and we are getting the acknowledgement from this get acknowledgement function. And then we are using this get rate rise method. So from all these Kafka properties object. We are calling those getter methods and retrieving those values and setting those values in this property object.
So We have partitioner class. So we are basically implementing this default partitioner which is present in over G. Apache car park client producer internals package. Then we are creating a producer over here and we are passing this props object which will set the properties so over here. We are passing the key serializer, which is the string T serializer. And then this is the value serializer in which we are creating new customer. Distance Eliezer and then we are passing transaction over here and then it will return the producer and then we are implementing thread we are again getting the get minimum pool size from Kafka properties and get maximum pool size from Kafka property. So we're here. We are implementing Java threads now.
Let me quickly close this cough pop producer configuration where we are configuring our Kafka producer. Let's go back. Let's quickly go to this API which have event producer EPA dot Java file. So here we are basically creating an event producer API which has this dispatch function. So we'll use this dispatch function to send the records. So let me quickly close this file. Let's go back. We have already seen this config and configurations in which we are basically retrieving those values from application dot yml file and then we are Setting the producer configurations, then we have constants. So in Kafka constants or Java, we have created this Kafka constant interface where we have specified the batch size account limit check some limit then read batch size minimum balance maximum balance minimum account maximum account. Then we are also implementing daytime for matter. So we are specifying all the constants over here. Let me close this file. Let's go back then this is Manso will not look at these two files, but let me tell you what does these two files to these two files are basically to record the metrics of your Kafka like time in which your thousand records have been produced in cough power.
You can say time in which records are getting published to Kafka. It will be monitored and then you can record those starts. So basically it helps in optimizing the performance of your Kafka producer, right? You can actually know how to do Recon. How to add just those configuration factors and then you can see the difference or you can actually monitor the stats and then understand or how you can actually make your producer more efficient. So these are basically for those factors but let's not worry about this right now. Let's go back next. Let me quickly take you through this file utility. So you have file you treated or Java. So basically what we are doing over here, we are reading each record from the file we using For reader so over here, you can see we have this list and then we have bufferedreader. Then we have file reader. So first we are reading the file and then we are trying to split each of the fields present in the record. And then we are setting the value of those fields over here. Then we are specifying some of the exceptions that may occur like number format exception or pass exception all those kind of exception we have specified over here and then we are Closing this so in this file.
We are basically reading the records now. Let me close this. Let's go back. Now. Let's take a quick look at the seal lizer. So this is custom Jason serializer. So in serializer, we have created a custom decency réaliser. Now, this is basically to write the values as bites. So the data which you will be passing will be written in bytes because as we know that data is sent to Kafka and form of pie. And this is the reason why we have created this custom Jason serializer. So now let me quickly close this let's go back. This file is basically for my spring boot web application. So let's not get into this. Let's look at events Red Dot Java. So basically over here we have event producer API. So now we are trying to dispatch those events and to show you how dispatch function works. Let me go back. Let me open services and even producer I MPL is implementation. So let me show you how this dispatch works.
So basically over here what we are doing first. We are initializing. So using the file utility. We are basically reading the files and read the file. We are getting the path using this Kafka properties object and we are calling this getter method of file path. Then what we are doing we are basically taking the product list and then we are trying to dispatch it so in dispatch Are basically using Kafka producer and then we are creating the object of the producer record. Then we are using the get topic from this calf pad properties. We are getting this transaction ID from the transaction and then we are using event producer send to send the data. And finally we are trying to monitor this but let's not worry about the monitoring and cash the monitoring and start spot so we can ignore this part Nets.
Let's quickly go back and look at the last file which is producer. So let me show you this event producer. So what we are doing here, we are actually creating a logger. So in this on completion method, we are basically passing the record metadata. And if your e-except shin is not null then it will basically throw an error saying this and recorded metadata else. It will give you the send message to topic partition. All set and then the record metadata and topic and then it will give you all the details regarding topic partitions and offsets. So I hope that you guys have understood how this cough cough producer is working now is the time we need to go ahead and we need to quickly execute this. So let me open a terminal over here. No first build this project. We need to execute mvn clean install. This will install all the dependencies.
So as you can see our build is successful. So let me minimize this and this target directory is created after you build a wave in project. So let me quickly go inside this target directory and this is the root dot bar file that is root dot web archive file which we need to execute. So let's quickly go ahead and execute this file. But before this to verify whether the data is getting produced in our car for topics so for testing as I already told you We need to go ahead and we need to open a console consumer so that we can check that whether data is getting published or not. So let me quickly minimize this. So let's quickly go to Kafka directory and the command is dot slash bin Kafka console consumer and then – – bootstrap server. nine zero nine two Okay, I'll let me check the topic. What's the topic? Let's go to our application dot yml file.
So the topic name is transaction. Let me quickly minimize this specify the topic name and I'll hit enter. So now let me place this console aside. And now let's quickly go ahead and execute our project. So for that the command is Java – jar and then we'll provide the path of the file that is inside. Great, and the file is rude dot war and here we go. So now you can see in the console consumer. The records are getting published. Right? So there are multiple records which have been published in our transaction topic and you can verify this using the console consumer. So this is where the developers use the console consumer. So now we have successfully verified our producer. So let me quickly go ahead and stop the producer. Lat, let me stop consumer as well.
Let's quickly minimize this and now let's go to the second project. That is Park streaming Kafka Master. Again. We have specified all the dependencies that is required. Let me quickly show you those dependencies. Now again, you can see were here. We have specified Java version then we have specified the artifacts or you can see the dependencies. So we have Scala compiler. Then we have spark streaming Kafka. Then we have cough cough clients. Then Json data binding then we have Maven compiler plug-in. So all those dependencies which are required. We are specified over here. So let me quickly go ahead and close it. Let's quickly move to the source directory main then let's look at the resources again. So this is application dot yml file. So we have put eight zero eight zero then we have bootstrap server over here. Then we have proven over here. Then we have topic is as transaction.
The group is transaction partition count is one and then the file name so we won't be using this file name then. Let me quickly go ahead and close this. Let's go back. Let's go back to Java directory comms Park demo, then this is the model. So it's same so these are all the fields that are there in the transaction you have transaction. Eight product price payment type the name city state country account created and so on. And again, we have specified all the getter and Setter methods over here and similarly again, we have created this transaction dto Constructor where we have taken all the parameters and then we have setting the values using this operator. Next. We are again over adding this tostring function and over here. We are again returning the details like transaction date and then vario transaction date product and then value of product and similarly all the fields. So let me close this model. Let's go back.
Let's look at cough covers, then we are see realizer. So this is the Jason serializer which was there in our producer and this is transaction decoder. Let's take a look. Now you have decoder which is again implementing decoder and we're passing this transaction dto then again, you can see we This problem by its method which we are overriding and we are reading the values using this bites and then transaction DDO class again, if it is failing to pass we are giving Json processing failed for object this and you can see we have this transaction decoder construct over here. So let me quickly again close this file. Let's quickly go back. And now let's take a look at spot streaming app where basically the data which the producer project will be producing to cough cough will be actually consumed by spark streaming application. So spark streaming will stream the data in real time and then will display the data. So in this park streaming application, we are creating conf object and then we are setting the application name as cough by sandbox. The master is local star then we have Java.
Fog contest so here we are specifying the spark context and then next we are specifying the Java streaming context. So this object will basically we used to take the streaming data. So we are passing this Java Spa context over here as a parameter and then we are specifying the duration that is 2000. Next. We have Kafka parameters should to connect to Kafka you need to specify this parameters. So in Kafka parameters, we are specifying The Meta broken. Why's that is localized 9:09 to then we have Auto offset resent that is smallest. Then in topics the name of the topic from which we will be consuming messages is transaction next Java. We're creating a Java pair input D streams. So basically this D stream is discrete stream, which is the basic abstraction of spark streaming and is a continuous sequence of rdds representing a continuous stream of data now the stream can I The created from live data from Kafka hdfs of Flume or it can be generated from transforming existing be streams using operation to over here. We are again creating a Java input D stream. We are passing string and transaction DTS parameters and we are creating direct Kafka stream object. Then we're using this Kafka you tails and we are calling the method create direct stream where we are passing the parameters as SSC that is your spark streaming context then you have String dot class which is basically your key serializer.
Then transaction video does not class that is basically your value serializer then string decoder that is to decode your key and then transaction decoded basically to decode your transaction. Then you have Kafka parameters, which you have created here where you have specified broken list and auto offset reset and then you are specifying the topics which is your transaction so next using this Cordy stream, you're actually continuously iterating over the rdd and then you are trying to print your new rdd with then already partition and size then rdd count and the record so already for each record. So you are printing the record and then you are starting these Park streaming context and then you are waiting for the termination. So this is the spark streaming application. So let's first quickly go ahead and execute this application. They've been close this file. Let's go to the source. Now, let me quickly go ahead and delete this target directory. So now let me quickly open the terminal MV and clean install. So now as you can see the target directory is again created and this park streaming Kafka snapshot jar is created.
So we need to execute this jar. So let me quickly go ahead and minimize it. Let me close this terminal. So now first I'll start this pop streaming job. So the command is Java – jar inside the target directory. We have this spark streaming of college are so let's hit enter. So let me know quickly go ahead and start producing messages. So I will minimize this and I will wait for the messages. So let me quickly close this pot streaming job and then I will show you the consumed records so you can see the record that is consumed from spark streaming. So here you have got record and transaction dto and then transaction date products all the details, which we are specified. You can see it over here. So this is how spark streaming works with Kafka now, it's just a basic job again. You can go ahead and you can take Those transaction you can perform some real-time analytics over there and then you can go ahead and write those results so over here we have just given you a basic demo in which we are producing the records to Kafka and then using spark streaming.
We are streaming those records from Kafka again. You can go ahead and you can perform multiple Transformations over the data multiple actions and produce some real-time results using this data. So this is just a basic demo where we have shown you how to basically produce recalls to Kafka and then consume those records using spark streaming. So let's quickly go back to our slide. Now as this was a basic project. Let me explain you one of the cough by spark streaming project, which is a Ted Eureka. So basically there is a company called Tech review.com. So this take review.com basically provide reviews for your recent and different Technologies, like a smart watches phones different operating systems and anything new that is coming into Market. So what happens is the company decided to include a new feature which will basically allow users to compare the popularity or trend of multiple Technologies based on the Twitter feeds and second for the USP. They are basically trying this comparison to happen in real time. So basically they have assigned you this task so that you have to go ahead you have to take the real-time Twitter feeds then you have to show the real time comparison of various Technologies.
So again, the company is is asking you to to identify the minute literate between different Technologies by consuming Twitter streams and writing aggregated minute Li count to Cassandra from where again – boarding team will come into picture and then they will try to dashboard that data and it can show you a graph where you can see how the trend of two different or you can see various Technologies are going ahead now the solution strategy which is there so you have to continuously stream the data from Twitter. Then you will be storing that those tweets inside a cop car topic then second again. You have to perform spark streaming. So you will be continuously streaming the data and then you will be applying some Transformations which will basically give you the minute trend of the two technologies.
And again, you'll write it back to a car for topic and at last you'll write a consumer that will be consuming messages from the Casbah topic and that will write the data in your Cassandra database. So First you have to write a program that will be consuming data from Twitter and I did to cough or topic. Then you have to write a spark streaming job, which will be continuously streaming the data from Kafka and perform analytics to identify the military Trend and then it will write the data back to a cuff for topic and then you have to write the third job which will be basically a consumer that will consume data from the table for topic and write the data to a Cassandra database. But a spark is a powerful framework, which has been heavily used in the industry for real-time analytics and machine learning purposes. So before I proceed with the session, let's have a quick look at the topics which will be covering today. So I'm starting off by explaining what exactly is by spot and how it works.
When we go ahead. We'll find out the various advantages provided by spark. Then I will be showing you how to install by sparking a systems. Once we are done with the installation. I will talk about the fundamental concepts of by spark like this spark context. Data frames MLA Oddities and much more and finally, I'll close of the session with the demo in which I'll show you how to implement by spark to solve real life use cases. So without any further Ado, let's quickly embark on our journey to pie spot now before I start off with by spark. Let me first brief you about the by spark ecosystem as you can see from the diagram the spark ecosystem is composed of various components like Sparks equals Park streaming. Ml Abe graphics and the core API component the spark. Equal component is used to Leverage The Power of decorative queries and optimize storage by executing sql-like queries on spark data, which is presented in rdds and other external sources spark streaming component allows developers to perform batch processing and streaming of data with ease in the same application. The machine learning library eases the development and deployment of scalable machine learning pipelines Graphics component.
Let's the data scientists work with graph and non graph sources to achieve flexibility and resilience in graph. Struction and Transformations and finally the spark core component. It is the most vital component of spark ecosystem, which is responsible for basic input output functions scheduling and monitoring the entire spark ecosystem is built on top of this code execution engine which has extensible apis in different languages like Scala Python and Java and in today's session, I will specifically discuss about the spark API in Python programming languages, which is more popularly known as the pie Spa. You might be wondering why pie spot well to get a better Insight. Let me give you a brief into pie spot. Now as you already know by spec is the collaboration of two powerful Technologies, which are spark which is an open-source clustering Computing framework built around speed ease of use and streaming analytics.
And the other one is python, of course python, which is a general purpose high-level programming language. It provides wide range of libraries and is majorly used for machine learning and real-time analytics now, Now which gives us by spark which is a python a pay for spark that lets you harness the Simplicity of Python and The Power of Apache spark. In order to tame pick data up ice pack. Also lets you use the rdds and come with a default integration of Pi Forge a library. We learn about rdds later in this video now that you know, what is pi spark. Let's now see the advantages of using spark with python as we all know python itself is very simple and easy. So when Spock is written in Python it To participate quite easy to learn and use moreover.
It's a dynamically type language which means Oddities can hold objects of multiple data types. Not only does it also makes the EPA simple and comprehensive and talking about the readability of code maintenance and familiarity with the python API for purchase Park is far better than other programming languages python also provides various options for visualization, which is not possible using Scala or Java moreover. You can conveniently call are directly from python on above this python comes with a wide range of libraries like numpy pandas Caitlin Seaborn matplotlib and these debris is in data analysis and also provide mature and time test statistics with all these feature. You can effortlessly program and spice Park in case you get stuck somewhere or habit out. There is a huge price but Community out there whom you can reach out and put your query and that is very actor.
So I will make good use of this opportunity to show you how to install Pi spark in a system now here I'm using Red Hat Linux based sent to a system the same steps can be applied for using Linux systems as well. So in order to install Pi spark first, make sure that you have Hadoop installed in your system. So if you want to know more about how to install Ado, please check out our new playlist on YouTube or you can check out our blog on a direct our website the first of all you need to go to the Apache spark official website, which is parked at a party Dot o– r– g– and the download section you can download the latest version of spark release which supports It's the latest version of Hadoop or Hadoop version 2.7 or above now. Once you have downloaded it, all you need to do is extract it or add say under the file contents. And after that you need to put in the path where the spark is installed in the bash RC file. Now, you also need to install pip and jupyter notebook using the pipe command and make sure that the version of piston or above so as you can see here, this is what our bash RC file looks like here you can see that we have put in the path for Hadoop spark and as well as Spunk driver python, which is The jupyter Notebook.
What we'll do. Is that the moment you run the pie Spock shell it will automatically open a jupyter notebook for you. Now. I find jupyter notebook very easy to work with rather than the shell is supposed to search choice now that we are done with the installation path. Let's now dive deeper into pie Sparkle on few of its fundamentals, which you need to know in order to work with by Spar. Now this timeline shows the various topics, which we will be covering under the pie spark fundamentals. So let's start off. With the very first Topic in our list. That is the spark context. The spark context is the heart of any spark application. It sets up internal services and establishes a connection to a spark execution environment through a spark context object.
You can create rdds accumulators and broadcast variable access Park service's run jobs and much more the spark context allows the spark driver application to access the cluster through a resource manager, which can be yarn or Sparks cluster manager the driver program then runs. Operations inside the executors on the worker nodes and Spark context uses the pie for Jay to launch a jvm which in turn creates a Java spark context. Now there are various parameters, which can be used with a spark context object like the Master app name spark home the pie files the environment in which has set the path size serializer configuration Gateway and much more among these parameters the master and app name are the most commonly used now to give you a basic Insight on how Spark program works. I have listed down its basic lifecycle phases the typical life cycle of a spark program includes creating rdds from external data sources or paralyzed a collection in your driver program. Then we have the lazy transformation in a lazily transforming the base rdds into new Oddities using transformation then caching few of those rdds for future reuse and finally performing action to execute parallel computation and to produce the results.
The next Topic in our list is added. And I'm sure people who have already worked with spark a familiar with this term, but for people who are new to it, let me just explain it. No Artie T stands for resilient distributed data set. It is considered to be the building block of any spark application. The reason behind this is these elements run and operate on multiple nodes to do parallel processing on a cluster. And once you create an RTD, it becomes immutable and by imitable, I mean that it is an object whose State cannot be modified after its created, but we can transform its values by up. Applying certain transformation. They have good fault tolerance ability and can automatically recover for almost any failures. This adds an added Advantage not to achieve a certain task multiple operations can be applied on these IDs which are categorized in two ways the first in the transformation and the second one is the actions the Transformations are the operations which are applied on an oddity to create a new rdd. Now these transformation work on the principle of lazy evaluation and transformation are lazy in nature.
Meaning when we call some operation in our dirty. It does not execute immediately spark maintains, the record of the operations it is being called through with the help of direct acyclic graphs, which is also known as the DHS and since the Transformations are lazy in nature. So when we execute operation any time by calling an action on the data, the lazy evaluation data is not loaded until it's necessary and the moment we call out the action all the computations are performed parallely to give you the desired output. Put now a few important Transformations are the map flatmap filter this thing reduced by key map partition sort by actions are the operations which are applied on an rdd to instruct a party spark to apply computation and pass the result back to the driver few of these actions include collect the collectors mapreduce take first now, let me Implement few of these for your better understanding.
So first of all, let me show you the bash as if I'll which I was talking about. So here you can see in the bash RC file. We provide the path for all the Frameworks which we have installed in the system. So for example, you can see here. We have installed Hadoop the moment we install and unzip it or rather see entire it I have shifted all my Frameworks to one particular location as you can see is the US are the user and inside this we have the library and inside that I have installed the Hadoop and also the spa now as you can see here, we have two lines. I'll highlight this one for you the pie spark driver. Titan which is the Jupiter and we have given it as a notebook the option available as know to what we'll do is at the moment. I start spark will automatically redirect me to The jupyter Notebook.
So let me just rename this notebook as rdd tutorial. So let's get started. So here to load any file into an rdd suppose. I'm loading a text file you need to use the S if it is a spark context as C dot txt file and you need to provide the path of the data which you are going to load. So one thing to keep in mind is that the default path which the artery takes or the jupyter. Notebook takes is the hdfs path. So in order to use the local file system, you need to mention the file colon and double forward slashes. So once our sample data is inside the ret not to have a look at it. We need to invoke using it the action. So let's go ahead and take a look at the first five objects or rather say the first five elements of this particular rdt. The sample it I have taken here is about blockchain as you can see.
We have one two, three four and five elements here. Suppose I need to convert all the data into a lowercase and split it according to word by word. So for that I will create a function and in the function I'll pass on this Oddity. So I'm creating as you can see here. I'm creating rdd one that is a new ID and using the map function or rather say the transformation and passing on the function, which I just created to lower and to split it. So if we have a look at the output of our D1 As you can see here, all the words are in the lower case and all of them are separated with the help of a space bar. Now this another transformation, which is known as the flat map to give you a flat and output and I am passing the same function which I created earlier. So let's go ahead and have a look at the output for this one. So as you can see here, we got the first five elements which are the save one as we got here the contrast transactions and and the records.
So just one thing to keep in mind. Is at the flat map is a transformation where as take is the action now, as you can see that the contents of the sample data contains stop words. So if I want to remove all the stop was all you need to do is start and create a list of stop words in which I have mentioned here as you can see. We have a all the as is and now these are not all the stop words. So I've chosen only a few of them just to show you what exactly the output will be and now we are using here the filter transformation and with the help of Lambda. Function and which we have X specified as X naught in stock quotes and we have created another rdd which is added III which will take the input from our DD to so let's go ahead and see whether and and the are removed or not. This is you can see contracts transaction records of them.
If you look at the output 5, we have contracts transaction and and the and in the are not in this list, but suppose I want to group the data according to the first three characters of any element. So for that I'll use the group by and I'll use the Lambda function again. So let's have a look at the output so you can see we have EDG and edges. So the first three letters of both words are same similarly. We can find it using the first two letters. Also, let me just change it to two so you can see we are gu and guid just a guide not these are the basic Transformations and actions but suppose. I want to find out the sum of the first thousand numbers. Others have first 10,000 numbers. All I need to do is initialize another Oddity, which is the number underscore ID. And we use the AC Dot parallelized and the range we have given is one to 10,000 and we'll use the reduce action here to see the output you can see here.
We have the sum of the numbers ranging from one to ten thousand. Now this was all about rdd. The next topic that we have on a list is broadcast and accumulators now in spark we perform parallel processing through the Help of shared variables or when the driver sends any tasks with the executor present on the cluster a copy of the shared variable is also sent to the each node of the cluster thus maintaining High availability and fault tolerance. Now, this is done in order to accomplish the task and Apache spark supposed to type of shared variables. One of them is broadcast. And the other one is the accumulator now broadcast variables are used to save the copy of data on all the nodes in a cluster. Whereas the accumulator is the variable that is used for aggregating the incoming. Information we are different associative and commutative operations now moving on to our next topic which is a spark configuration the spark configuration class provides a set of configurations and parameters that are needed to execute a spark application on the local system or any cluster. Now when you use spark configuration object to set the values to these parameters, they automatically take priority over the system properties. Now this class contains various Getters and Setters methods some of which are Set method which is used to set a configuration property.
We have the set master which is used for setting the master URL. Yeah the set app name, which is used to set an application name and we have the get method to retrieve a configuration value of a key. And finally we have set spark home which is used for setting the spark installation path on worker nodes. Now coming to the next topic on our list which is a spark files the spark file class contains only the class methods so that the user cannot create any spark files instance. Now this helps in Dissolving the path of the files that are added using the spark context add file method the class Park files contain to class methods which are the get method and the get root directory method. Now, the get is used to retrieve the absolute path of a file added through spark context to add file and the get root directory is used to retrieve the root directory that contains the files that are added. So this park context dot add file. Now, these are smart topics and the next topic that we will covering in our list are the data frames now data frames in a party. Spark is a distributed collection of rows under named columns, which is similar to the relational database tables or Excel sheets.
It also shares common attributes with the rdds few characteristics of data frames are immutable in nature. That is the same as you can create a data frame, but you cannot change it. It allows lazy evaluation. That is the task not executed unless and until an action is triggered and moreover data frames are distributed in nature, which are designed for processing large collection of structure or semi-structured data. Can be created using different data formats, like loading the data from source files such as Json or CSV, or you can load it from an existing re you can use databases like hi Cassandra. You can use pocket files. You can use CSV XML files. There are many sources through which you can create a particular R DT now, let me show you how to create a data frame in pie spark and perform various actions and Transformations on it.
So let's continue this in the same notebook which we have here now here we have taken In the NYC Flight data, and I'm creating a data frame which is the NYC flights on the score TF now to load the data. We are using the spark dot RI dot CSV method and you to provide the path which is the local path of by default. It takes the hdfs same as our GD and one thing to note down here is that I've provided two parameters extra here, which is the info schema and the header if we do not provide this as true of a skip it what will happen. Is that if your data set Is the name of the columns on the first row it will take those as data as well. It will not infer the schema now. Once we have loaded the data in our data frame we need to use the show action to have a look at the output. So as you can see here, we have the output which is exactly it gives us the top 20 rows or the particular data set. We have the year month day departure time deposit delay arrival time arrival delay and so many more attributes.
To print the schema of the particular data frame you need the transformation or as say the action of print schema. So let's have a look at the schema. As you can see here we have here which is integer month integer. Almost half of them are integer. We have the carrier as string the tail number a string the origin string destination string and so on now suppose. I want to know how many records are there in my database or the data frame rather say so you need the count function for this one. I will provide but the results so as you can see here, we have three point three million records here three million thirty six thousand seven hundred seventy six to be exact now suppose. I want to have a look at the flight name the origin and the destination of just these three columns from the particular data frame. We need to use the select option. So as you can see here, we have the top 20 rows.
Now, what we saw was the select query on this particular data frame, but if I wanted to see or rather, I want to check the summary. Of any particular column suppose. I want to check the what is the lowest count or the highest count in the particular distance column. I need to use the describe function here. So I'll show you what the summer it looks like. So the distance the count is the number of rows total number of rows. We have the mean the standard deviation via the minimum value, which is 17 and the maximum value, which is 4983. Now this gives you a summary of the particular column if you want to So that we know that the minimum distance is 70. Let's go ahead and filter out our data using the filter function in which the distance is 17. So you can see here. We have one data in which in the 2013 year the minimum distance here is 17 but similarly suppose I want to have a look at the flash which are originating from EWR. Similarly. We use the filter function here as well.
Now the another Clause here, which is the where Clause is also used for filtering the suppose. I want to have a look at the flight data and filter it out to see if the day at work. Which the flight took off was the second of any month suppose. So here instead of filter. We can also use a where clause which will give us the same output. Now, we can also pass on multiple parameters and rather say the multiple conditions. So suppose I want the day of the flight should be seventh and the origin should be JFK and the arrival delay should be less than 0 I mean that is for none of the postponed fly. So just to have a look at these numbers will use the way clause and separate all the conditions using the + symbol so you can see here all the data. The day is 7 the origin is JFK and the arrival delay is less than 0 now.
These were the basic Transformations and actions on the particular data frame. Now one thing we can also do is create a temporary table for SQL queries if someone is not good or is not Wanted to all these transformation and action add would rather use SQL queries on the data. They can use this register dot temp table to create a table for their particular data frame. What we'll do is convert the NYC flights and a Squatty of data frame into NYC endoscope flight table, which can be used later and SQL queries can be performed on this particular table. So you remember in the beginning we use the NYC flies and score d f dot show now we can use the select asterisk from I am just go flights to get the same output now suppose we want to look at the minimum a time of any flights. We use the select minimum air time from NYC flights. That is the SQL query. We pass all the SQL query in the sequel context or SQL function. So you can see here. We have the minimum air time as 20 now to have a look at the Wreckers in which the air time is minimum 20.
Now we can also use nested SQL queries a suppose if I want to check which all flights have the Minimum air time as 20 now that cannot be done in a simple SQL query we need nested query for that one. So selecting aspects from New York flights where the airtime is in and inside that we have another query, which is Select minimum air time from NYC flights. Let's see if this works or not. CS as you can see here, we have two Flats which have the minimum air time as 20. So guys this is it for data frames. So let's get back to our presentation and have a look at the list which we were following. We completed data frames. Next we have stories levels now Storage level in pie spark is a class which helps in deciding how the rdds should be stored now based on this rdds are either stored in this or in memory or in both the class Storage level also decides whether the RADS should be serialized or replicate its partition for the final and the last topic for the today's list is MLM blog MLM is the machine learning APA which is provided by spark, which is also present in Python. And this library is heavily used in Python for machine learning as well as real-time streaming analytics Aurelius algorithm supported by this libraries are first of all, we have the spark dot m l live now recently the spice Park MN lips supports model based collaborative filtering by a small set of latent factors and here all the users and the products are described which we can use to predict them.
Missing entries however to learn these latent factors Park dot ml abuses the alternatingly square which is the ALS algorithm. Next we have the MLF clustering and are supervised learning problem is clustering now here we try to group subsets of entities with one another on the basis of some notion of similarity. Next. We have the frequent pattern matching, which is the fpm now frequent pattern matching is mining frequent items item set subsequences or other Lectures that are usually among the first steps to analyze a large-scale data set. This has been an active research topic in data mining for years. We have the linear algebra. Now this algorithm support spice Park, I mean live utilities for linear algebra. We have collaborative filtering. We have classification for binary classification various methods are available in sparked MLA package such as multi-class classification as well as regression analysis in classification some of the most popular Terms used are Nave by a strand of forest decision tree and so much and finally we have the linear regression now basically lead integration comes from the family of recreation algorithms to find relationships and dependencies between variables is the main goal of regression all the pie spark MLA package also covers other algorithm classes and functions.
Let's now try to implement all the concepts which we have learned in pie spark tutorial session now here we are going to use a heart disease prediction model and we are going to predict Using the decision tree with the help of classification as well as regression. Now. These all are part of the ml Live library here. Let's see how we can perform these types of functions and queries. The first of all what we need to do is initialize the spark context. Next we are going to read the UCI data set of the heart disease prediction and we are going to clean the data. So let's import the pandas and the numpy library here. Let's create a data frame as heart disease TF and as mentioned earlier, we are going to use the read CSV method here and here we don't have a header. So we have provided header as none. Now the original data set contains 300 3 rows and 14 columns. Now the categories of diagnosis of heart disease that we are projecting if the value 0 is for 50% less than narrowing and for the value 1 which we are giving is for the values which have 50% more diameter of naren.
So here we are using the numpy library. These are particularly old methods which is showing the deprecated warning but no issues it will work fine. So as you can see here, we have the categories of diagnosis of heart disease that we are predicting the value 0 is 4 less than 50 and value 1 is greater than 50. So what we did here was clear the row which have the question mark or which have the empty spaces. Now to get a look at the data set here. Now, you can see here. We have zero at many places instead of the question mark which we had earlier and now we are saving it to a txt file. And you can see her after dropping the rose with any empty values. We have two ninety seven rows and 14 columns. But this is what the new clear data set looks like now we are importing the ml lived library and the regression here now here what we are going to do is create a label point, which is a local Vector associated with a label or a response.
So for that we need to import the MLF dot regression. So for that we are taking the text file which we just created now without the missing values. Now next. What we are going to do is pass the MLA data line by line into the MLM label Point object and we are going to convert the – one labels to the 0 now. Let's have a look after passing the number of fishing lines. Okay, we have to label .01. That's cool. Now next what we are going to do is perform classification using the decision tree. So for that we need to import the pie spark the ml 8.3. So next what we have to do is split the data into the training and testing data and we split here the data into 70s 233 standard ratio, 70 being the training data set and the 30% being the testing data set next what we do is that we train the model.
Which we are created here using the training set. We have created a training model decision trees or trained classifier. We have used a training data number of classes is file the categorical feature, which we have given maximum depth to which we are classifying. It is 3 the next what we are going to do is evaluate the model based on the test data set now and evaluate the error. So here we are creating predictions and we are using the test data to get the predictions through the model which we Do and we are also going to find the test errors here. So as you can see here, the test error is zero point 2 2 9 7 we have created a classification decision tree model in which the feature less than 12 is 3 the value of the features distance 0 is 54. So as you can see our model is pretty good. So now next we'll use regression for the same purposes. So let's perform the regression using decision tree. So as you can see we have the train model and we are using the decision tree, too.
Trine request using the training data the same which we created using the decision tree model up there. We use the classification now we are using regression now similarly. We are going to evaluate our model using our test data set and find that test errors which is the mean squared error here for aggression. So let's have a look at the mean square error here. The mean square error is 0.168. That is good. Now finally if we have a look at the Learned regression tree model. You can see we have created the regression tree model till the depth of 3 with 15 notes. And here we have all the features and classification of the tree. Hello folks. Welcome to spawn interview questions. The session has been planned collectively to have commonly asked interview questions later to the smart technology and the general answer and the expectation is already you are aware of this particular technology.
To some extent and in general the common questions being asked as well as I will give interaction with the technology as so let's get this started. So the agenda for this particular session is the basic questions are going to cover and questions later to the spark core Technologies. That's when I say spark or that's going to be the base and top of spark or we have four important components which work that is streaming Graphics. Ml Abe and SQL all these components have been created to satisfy a The government again interaction with these Technologies and get into the commonly asked interview questions and the questions also framed such a way. It covers the spectrum of the doubts as well as the features available within that specific technology. So let's take the first question and look into the answer like how commonly this covered. What is Apache spark and Spark It's with Apache Foundation now, it's open source.
It's a cluster Computing framework for real-time processing. So three main keywords over. Here a purchase markets are open source project. It's used for cluster Computing. And for a memory processing along with real-time processing. It's going to support in memory Computing. So the lots of project which supports cluster Computing along with that spark differentiate Itself by doing the in-memory Computing. It's very active community and out of the Hadoop ecosystem technology is Apache spark is very active multiple releases. We got last year. It's a very inactive project among the about your Basically, it's a framework kind support in memory Computing and cluster Computing and you may face this specific question how spark is different than mapreduce on how you can compare it with the mapreduce mapreduce is the processing pathology within the Hadoop ecosystem and within Hadoop ecosystem. We have hdfs Hadoop distributed file system mapreduce going to support distributed computing and how spark is different.
So how we can compare smart with them. Mapreduce in a way this comparison going to help us to understand the technology better. But definitely like we cannot compare these two or two different methodologies by which it's going to work spark is very simple to program but mapreduce there is no abstraction or the sense like all the implementations we have to provide and interactivity. It's has an interactive mode to work with inspark a mapreduce. That is no interactive mode. There are some components like Apache. Big and high which facilitates has to do the interactive Computing or interactive programming and smog supports real-time stream processing and to precisely say with inspark the stream processing is called a near real-time processing. There's nothing in the world is Real Time processing.
It's near real-time processing. It's going to do the processing and micro batches. I'll cover in detail when we are moving onto the streaming concept and you're going to do the batch processing on the historical data in Matrix. Zeus when I say stream processing I will get the data that is getting processed in real time and do the processing and get the result either store it on publish to publish Community. We will be doing it let and see wise mapreduce will have very high latency because it has to read the data from hard disk, but spark it will have very low latency because it can reprocess are used the data already cased in memory, but there is a small catch over here in spark first time when the data gets loaded it has Tool to read it from the hard disk same as mapreduce. So once it is red it will be there in the memory. So spark is good. Whenever we need to do I treat a Computing so spark whenever you do I treat a Computing again and again to the processing on the same data, especially in machine learning deep learning all we will be using the iterative Computing his Fox performs much better. You will see the rock performance Improvement hundred times faster than mapreduce.
But if it is one time processing and fire-and-forget, Get the type of processing spark lately, maybe the same latency, you will be getting a tan mapreduce maybe like some improvements because of the building block or spark. That's the ID you may get some additional Advantage. So that's the key feature are the key comparison factor of sparkin mapreduce. Now, let's get on to the key features xnk features of spark. We discussed over the Speed and Performance. It's going to use the in-memory Computing so Speed and Performance. Place it's going to much better. When we do actually to Computing and Somali got the sense the programming language to be used with a spark. It can be any of these languages can be python. Java are our scale. Mm. We can do programming with any of these languages and data formats to give us a input. We can give any data formats like Jason back with a data formats began if there is a input and the key selling point with the spark is it's lazy evaluation the since it's going To calculate the DAC cycle directed acyclic graph d a g because that is a th e it's going to calculate what all steps needs to be executed to achieve the final result.
So we need to give all the steps as well as what final result I want. It's going to calculate the optimal cycle on optimal calculation. What else tips needs to be calculated or what else tips needs to be executed only those steps it will be executing it. So basically it's a lazy execution only if the results needs to be processed, it will be processing that. Because of it and it's about real-time Computing. It's through spark streaming that is a component called spark streaming which supports real-time Computing and it gels with Hadoop ecosystem variable. It can run on top of Hadoop Ian or it can Leverage The hdfs to do the processing. So when it leverages the hdfs the Hadoop cluster container can be used to do the distributed computing as well as it can leverage the resource manager to manage the resources so spot. I can gel with the hdfs very well as well as it can leverage the resource manager to share the resources as well as data locality. You can give each data locality.
It can do the processing we have to the database data is located within the hdfs and has a fleet of machine learning algorithms already implemented right from clustering classification regression. All this logic already implemented and machine learning. It's achieved using MLA be within spark and there is a component called a graphics which supports Maybe we can solve the problems using graph Theory using the component Graphics within this park. So these are the things we can consider as the key features of spark. So when you discuss with the installation of the spark, you may come across this year on what is he on do you need to install spark on all nodes of young cluster? So yarn is nothing but another is US negotiator. That's the resource manager within the Hadoop ecosystem. So that's going to provide the resource management platform. Ian going to provide the resource management platform across all the Clusters and Spark It's going to provide the data processing. So wherever there is a horse being used that location response will be used to do the data processing. And of course, yes, we need to have spark installed on all the nodes. It's Parker stores are located.
That's basically we need those libraries an additional to the installation of spark and all the worker nodes. We need to increase the ram capacity on the VOC emissions as well as far going to consume huge amounts. Memory to do the processing it will not do the mapreduce way of working internally. It's going to generate the next cycle and do the processing on top of yeah, so Ian and the high level it's like resource manager or like an operating system for the distributed computing. It's going to coordinate all the resource management across the fleet of servers on top of it. I can have multiple components like spark these giraffe this park especially it's going to help Just watch it in memory Computing. So sparkly on is nothing but it's a resource manager to manage the resource across the cluster on top of it. We can have spunk and yes, we need to have spark installed and all the notes on where the spark yarn cluster is used and also additional to that.
We need to have the memory increased in all the worker robots. The next question goes like this what file system response support. What is the file system then we work in individual system. We will be having a file system to work within that particular operating system Mary redistributed cluster or in the distributed architecture. We need a file system with which where we can store the data in a distribute mechanism. How do comes with the file system called hdfs. It's called Hadoop distributed file system by data gets distributed across multiple systems and it will be coordinated by 2. Different type of components called name node and data node and Spark it can use this hdfs directly. So you can have any files in hdfs and start using it within the spark ecosystem and it gives another advantage of data locality when it does the distributed processing wherever the data is distributed.
The processing could be done locally to that particular Mission way data is located and to start with as a standalone mode. You can use the local file system aspect. So this could be used especially when we are doing the development or any of you see you can use the local file system and Amazon Cloud provides another file system called. Yes, three simple storage service we call that is the S3. It's a block storage service. This can also be leveraged or used within spa for the storage and lot other file system. Also, it supports there are some file systems like Alex, oh which provides in memory storage so we can leverage that particular file system as well. So we have seen all the features. What are the functionalities available with inspark? We're going to look at the limitations of using spark. Of course every component when it comes with a huge power and Advantage. It will have its own limitations as well. The equation illustrates some limitations of using spark spark utilizes more storage space compared to Hadoop and it comes to the installation.
It's going to consume more space but in the Big Data world, that's not a very huge constraint because storage cons is not Great are very high and our big data space and developer needs to be careful while running the apps and Spark the reason because it uses in-memory Computing. Of course, it handles the memory very well. But if you try to load a huge amount of data and the distributed environment and if you try to do is join when you try to do join within the distributed world the data going to get transferred over the network network is really a costly resource So the plan or design should be such a way to reduce or minimize. As the data transferred over the network and however the way possible with all possible means we should facilitate distribution of theta over multiple missions the more we distribute the more parallelism we can achieve and the more results we can get and cost efficiency.
If you try to compare the cost how much cost involved to do a particular processing take any unit in terms of processing 1 GB of data with say like II Treaty processing if you come Cost-wise in-memory Computing always it's considered because memory It's relatively come costlier than the storage so that may act like a bottleneck and we cannot increase the memory capacity of the mission Beyond supplement. So we have to grow horizontally. So when we have the data distributor in memory across the cluster, of course the network transfer all those bottlenecks will come into picture. So we have to strike the right balance which will help us to achieve the in-memory computing. Whatever, they memory computer repair it will help us to achieve and it consumes huge amount of data processing compared to Hadoop and Spark it performs better than use it as a creative Computing because it likes for both spark and the other Technologies.
It has to read data for the first time from the hottest car from other data source and Spark performance is really better when it reads the data onto does the processing when the data is available in the cache, of course is the DAC cycle. It's going to give us a lot of advantage while doing the processing but the in-memory Computing processing that's going to give us lots of Leverage. The next question list some use cases where Spark outperforms Hadoop in processing. The first thing is the real time processing. How do you cannot handle real time processing but spark and handle real time processing. So any data that's coming in in the land architecture. You will have three layers. The most of the Big Data projects will be in the Lambda architecture.
You will have speed layer by layer and sighs Leo and the speed layer whenever the river comes in that needs to be processed stored and handled. So in those type of real-time processing stock is the best fit. Of course, we can Hadoop ecosystem. We have other components which does the real-time processing like storm. But when you want to Leverage The Machine learning along with the Sparks dreaming on such computation spark will be much better. So that's why I like when you have architecture like a Lambda architecture you want to have all three layers bachelier speed layer and service. A spark and gel the speed layer and service layer far better and it's going to provide better performance. And whenever you do the edge processing especially like doing a machine learning processing, we will leverage nitrate in Computing and can perform a hundred times faster than Hadoop the more diversity processing that we do the more data will be read from the memory and it's going to get as much faster performance than I did with mapreduce.
So again, remember whenever you do the processing only buns, so you're going to to do the processing finally bonds read process it and deliver. The result spark may not be the best fit that can be done with a mapreduce itself. And there is another component called akka it's a messaging system our message quantity in system Sparkle internally uses account for scheduling our any task that needs to be assigned by the master to the worker and the follow-up of that particular task by the master basically asynchronous coordination system and that's achieved using akka I call programming internally it's used by this monk as such for the developers. We don't need to worry about a couple of growing up. Of course we can leverage it but the car is used internally by the spawn for scheduling and coordination between master and the burqa and with inspark. We have few major components. Let's see, what are the major components of a possessed man. The lay the components of spot ecosystem start comes with a core engine.
So that has the core. Realities of what is required from by the spark of all this Punk Oddities are the building blocks of the spark core engine on top of spark or the basic functionalities are file interaction file system coordination all that's done by the spark core engine on top of spark core engine. We have a number of other offerings to do machine learning to do graph Computing to do streaming. We have n number of other components. So the major use the components of these components like Sparks equal. Spock streaming. I'm a little graphics and Spark our other high level. We will see what are these components Sparks equal especially it's designed to do the processing against a structure data so we can write SQL queries and we can handle or we can do the processing. So it's going to give us the interface to interact with the data, especially structure data and other language that we can use it's more similar to what we use within the SQL. Well, I can say 99 percentage is seen and most of the commonly used functionalities within the SQL have been implemented within smocks equal and Spark streaming is going to support the stream processing.
That's the offering available to handle the stream processing and MLA based the offering to handle machine learning. So the component name is called ml in and has a list of components a list of machine learning algorithms already defined we can leverage and use any of those machine learning. Graphics again, it's a graph processing offerings within the spark. It's going to support us to achieve graph Computing against the data that we have like pagerank calculation. How many connector identities how many triangles all those going to provide us a meaning to that particular data and Spark are is the component is going to interact or helpers to leverage. The language are within the spark environment are is a statistical programming language. Each where we can do statistical Computing, which is Park environment and we can leverage our language by using this parka to get that executed within the spark a environment addition to that. There are other components as well like approximative is it's called blink DB all other things I can be test each.
So these are the major Lee used components within spark. So next question. How can start be used alongside her too? So when we see a spark performance much better it's not a replacement to handle it. Going to coexist with the Hadoop right Square leveraging the spark and Hadoop together. It's going to help us to achieve the best result. Yes. Mark can do in memory Computing or can handle the speed layer and Hadoop comes with the resource manager so we can leverage the resource manager of Hadoop to make smart to work and few processing be don't need to Leverage The in-memory Computing. For example, one time processing to the processing and forget. I just store it we can use mapreduce. He's so the processing cost Computing cost will be much less compared to Spa so we can amalgam eyes and get strike the right balance between the batch processing and stream processing when we have spark along with Adam. Let's have some detail question later to spark core with inspark or as I mentioned earlier the core building block of spark or is our DD resilient distributed data set.
It's a virtual. It's not a physical entity. It's a logical entity. You will not See this audit is existing. The existence of hundred will come into picture when you take some action. So this is our Unity will be used are referred to create the DAC cycle and arteries will be optimized to transform from one form to another form to make a plan how the data set needs to be transformed from one structure to another structure. And finally when you take some against an RTD that existence of the data structure that resulted in data will come into picture and that can be stored in any file system whether it's GFS is 3 or any other file system can be stored and that it is can exist in a partition form the sense. It can get distributed across multiple systems and it's fault tolerant and it's a fault tolerant. If any of the artery is lost any partition of the RTD is lost. It can regenerate only that specific partition it can regenerate so that's a huge advantage of our GD.
So it's a mass like first the huge advantage of added. It's a fault-tolerant where it can regenerate the last rdds. And it can exist in a distributed fashion and it is immutable the since once the RTD is defined on like it it cannot be changed. The next question is how do we create rdds in spark the two ways we can create The Oddities one as isn't the spark context we can use any of the collections that's available within this scalar or in the Java and using the paralyzed function. We can create the RTD and it's going to use the underlying file systems distribution mechanism if The data is located in distributed file system, like hdfs. It will leverage that and it will make those arteries available in a number of systems. So it's going to leverage and follow the same distribution and already Aspen or we can create the rdt by loading the data from external sources as well like its peace and hdfs be may not consider as an external Source. It will be consider as a file system of Hadoop.
So when Spock is working with Hadoop mostly the file system, we will be using will be Hdfs, if you can read from it each piece or even we can do from other sources, like Parkwood file or has three different sources a roux. You can read and create the RTD. Next question is what is executed memory in spark application. Every Spark application will have fixed. It keeps eyes and fixed number, of course for the spark executor executor is nothing but the execution unit available in every machine and that's going to facilitate to do the processing to do the tasks in the Water machine, so irrespective of whether you use yarn resource manager or any other measures like resource manager every worker Mission. We will have an Executor and within the executor the task will be handled and the memory to be allocated for that particular executor is what we Define as the hip size and we can Define how much amount of memory should be used for that particular executor within the worker machine as well. As number of cores can be used within the exit. Our by the executor with this path application and that can be controlled through the configuration files of spark.
Next questions different partitions in Apache spark. So any data irrespective of whether it is a small data a large data, we can divide those data sets across multiple systems the process of dividing the data into multiple pieces and making it to store across multiple systems as a different logical units. It's called partitioning. So in simple terms partitioning is nothing but the process of Dividing the data and storing in multiple systems is called partitions and by default the conversion of the data into R. TD will happen in the system where the partition is existing. So the more the partition the more parallelism they are going to get at the same time. We have to be careful not to trigger huge amount of network data transfer as well and every a DD can be partitioned with inspark and the panel is the partitioning going to help us to achieve parallelism more the partition that we have more. Solutions can be done and that the key thing about the success of the spark program is minimizing the network traffic while doing the parallel processing and minimizing the data transfer within the systems of spark.
What operations does already support so I can operate multiple operations against our GD. So there are two type of things we can do we can group it into two one is transformations in Transformations are did he will get transformed from one form to another form. Select filtering grouping all that like it's going to get transformed from one form to another form one small example, like reduced by key filter all that will be Transformations. The resultant of the transformation will be another rdd the same time. We can take some actions against the rdd that's going to give us the final result. I can say count how many records or they are store that result into the hdfs. They all our actions so multiple actions can be taken against the RTD. The existence of the data will come into picture only if I take some action against not ready.
Okay. Next question. What do you understand by transformations in spark? So Transformations are nothing but functions mostly it will be higher order functions within scale and we have something like a higher order functions which will be applied against the tardy. Mostly against the list of elements that we have within the rdd that function will get applied by the existence of the arditi will Come into picture one lie if we take some action against it in this particular example, I am reading the file and having it within the rdd Control Data then I am doing some transformation using a map. So it's going to apply a function so we can map I have some function which will split each record using the tab. So the spit with the app will be applied against each record within the raw data and the resultant movies data will again be another rdd, but of course, this will be a lazy operation.
The existence of movies data will come into picture only if I take some action against it like count or print or store only those actions will generate the data. So next question Define functions of spark code. So that's going to take care of the memory management and fault tolerance of rdds. It's going to help us to schedule distribute the task and manage the jobs running within the cluster and so we're going to help us to or store the rear in the storage system as well as reads data from the storage. System that's to do the file system level operations. It's going to help us and Spark core programming can be done in any of these languages like Java scalar python as well as using our so core is that the horizontal level on top of spark or we can have a number of components and there are different type of rdds available one such a special type is parody. So next question. What do you understand by pay an rdd? It's going to exist in peace as a keys and values so I can Some special functions within the parodies are special Transformations, like connect all the values corresponding to the same key like solder Shuffle what happens within the shortened Shuffle of Hadoop those type of operations like you want to consolidate our group all the values corresponding to the same key are apply some functions against all the values corresponding to the same key.
Like I want to get the sum of the value of all the keys we can use the parody. D and get that a cheat so it's going to the data within the re going to exist in Pace keys and right. Okay a question from Jason. What are our Vector rdds in machine learning you will have huge amount of processing handled by vectors and matrices and we do lots of operations Vector operations, like effective actor or transforming any data into a vector form so vectors like as the normal way it will have a Direction. And magnitude so we can do some operations like some two vectors and what is the difference between the vector A and B as well as a and see if the difference between Vector A and B is less compared to a and C we can say the vector A and B is somewhat similar in terms of features. So the vector R GD will be used to represent the vector directly and that will be used extensively while doing the measuring and Jason.
Thank you other. Is another question. What is our GD lineage? So here I any data processing any Transformations that we do it maintains something called a lineage. So what how data is getting transformed when the data is available in the partition form in multiple systems and when we do the transformation, it will undergo multiple steps and in the distributed word. It's very common to have failures of machines or machines going out of the network and the system our framework as it should be in a position to handle small handles it through. Did he leave eh it can restore the last partition only assume like out of ten machines data is distributed across five machines out of that those five machines One mission is lost. So whatever the latest transformation that had the data for that particular partition the partition in the last mission alone can be regenerated and it knows how to regenerate that data on how to get that result and data using the concept of rdd lineage so from which Each data source, it got generated. What was its previous step.
So the completely is will be available and it's maintained by the spark framework internally. We call that as Oddities in eh, what is point driver to put it simply for those who are from her do background yarn back room. We can compare this to at muster. Every application will have a spark driver that will have a spot context which is going to moderate the complete execution of the job that will connect to the spark master. Delivers the RTD graph that is the lineage for the master and the coordinate the tasks. What are the tasks that gets executed in the distributed environment? It can do the parallel processing do the Transformations and actions against the RTD. So it's a single point of contact for that specific application. So smart driver is a short linked and the spawn context within this part driver is going to be the coordinator between the master and the tasks that are running and smart driver.
I can get started in any of the executor with inspark name types of custom managers in spark. So whenever you have a group of machines, you need a manager to manage the resources the different type of the store manager already. We have seen the yarn yet another assist ago. She later which manages the resources of Hadoop on top of yarn we can make Spock to book sometimes I may want to have sparkle own my organization and not along with the Hadoop or any other technology. Then I can go with the And alone spawn has built-in cluster manager. So only spawn can get executed multiple systems. But generally if we have a cluster we will try to leverage various other Computing platforms Computing Frameworks, like graph processing giraffe these on that. We will try to leverage that case. We will go with yarn or some generalized resource manager, like masseuse Ian. It's very specific to Hadoop and it comes along with Hadoop measures is the cluster level resource manager and I have multiple clusters.
Within organization, then you can use mrs. Mrs. Is also a resource manager. It's a separate table project within Apache X question. What do you understand by worker node in a cluster redistribute environment. We will have n number of workers we call that is a worker node or a slave node, which does the actual processing going to get the data do the processing and get us the result and masternode going to assign what has to be done by one person own and it's going to read the data available in the specific work on. Generally, the tasks assigned to the worker node, or the task will be assigned to the output node data is located in vigorous Pace. Especially Hadoop always it will try to achieve the data locality. That's what we can't is the resource availability as well as the availability of the resource in terms of CPU memory as well will be considered as you might have some data in replicated in three missions. All three machines are busy doing the work and no CPU or memory available to start the other task. It will not wait.
For those missions to complete the job and get the resource and do the processing it will start the processing and some other machine which is going to be near to that the missions having the data and read the data over the network. So to answer straight or commissions are nothing but which does the actual work and going to report to the master in terms of what is the resource utilization and the tasks running within the work emissions will be doing the actual work and what ways as past Vector just few minutes back. I was answering a question. What is a vector vector is nothing but representing the data in multi dimensional form? The vector can be multi-dimensional Vector as well. As you know, I am going to represent a point in space. I need three dimensions the X y&z. So the vector will have three dimensions.
If I need to represent a line in the species. Then I need two points to represent the starting point of the line and the endpoint of the line then I need a vector which can hold so it will have two Dimensions the first First Dimension will have one point the second dimension will have another Point let us say point B if I have to represent a plane then I need another dimension to represent two lines. So each line will be representing two points same way. I can represent any data using a vector form as you might have huge number of feedback or ratings of products across an organization. Let's take a simple example Amazon Amazon have millions of products. Not every user not even a single user would have It was millions of all the products within Amazon. The only hardly we would have used like a point one percent or like even less than that, maybe like few hundred products. We would have used and rated the products within amazing for the complete lifetime.
If I have to represent all ratings of the products with director and see the first position of the rating it's going to refer to the product with ID 1 second position. It's going to refer to the product with ID 2. So I will have million values within that particular vector. After out of million values, I'll have only values 400 products where I have provided the ratings. So it may vary from number 1 to 5 for all others. It will say 0 sparse pins thinly distributed. So to represent the huge amount of data with the position and saying this particular position is having a 0 value we can mention that with a key and value. So what position having what value rather than storing all Zero seconds told one lie non-zeros the position of it and that the corresponding value. That means all others going to be a zero value so we can mention this particular space Vector mentioning it to representa nonzero entities.
So to store only the nonzero entities this Mass Factor will be used so that we don't need to based on additional space was during this past Vector. Let's discuss some questions on spark streaming. How is streaming Dad in sparking explained with examples smart streaming is used for processing real-time streaming data to precisely say it's a micro batch processing. So data will be collected between every small interval say maybe like .5 seconds or every seconds until you get processed. So internally, it's going to create micro patches the data created out of that micro batch we call there is a d stream the stream is like a and ready so I can do Transformations and actions. Whatever that I do with our DD I can do With the stream as well and Spark streaming can read data from Flume hdfs are other streaming services Aspen and store the data in the dashboard or in any other database and it provides very high throughput as it can be processed with a number of different systems in a distributed fashion again streaming. This stream will be partitioned internally and it has the built-in feature of fault tolerance, even if any data is lost and it's transformed already is Lost it can regenerate those rdds from the existing or from the source data.
So these three is going to be the building block of streaming and it has the fault tolerance mechanism what we have within the RTD. So this stream are specialized on Didi specialized form of our GD specifically to use it within this box dreaming. Okay. Next question. What is the significance of sliding window operation? That's a very interesting one in the streaming data whenever we do the Computing the data. Density are the business implications of that specific data May oscillate a lot. For example within Twitter. We used to say the trending tweet hashtag just because that hashtag is very popular. Maybe someone might have hacked into the system and used a number of tweets maybe for that particular our it might have appeared millions of times just because it appear billions of times for that specific and minute duration or like say to three minute duration each not getting to the trending tank.
Trending hashtag for that particular day or for that particular month. So what we will do we will try to do an average. So like a window this current time frame and T minus 1 T minus 2 all the data we will consider and we will try to find the average or some so the complete business logic will be applied against that particular window. So any drastic changes on to precisely say the spike or deep very drastic spinal cords drastic deep in the pattern of the data will be normalized. So that's the because significance of using the sliding window operation with inspark streaming and smart can handle this sliding window automatically. It can store the prior data the T minus 1 T minus 2 and how big the window needs to be maintained or that can be handled easily within the program and it's at the abstract level. Next question is what is destroying the expansion is discretized stream. So that's the abstract form or the which will form of representation of the data. For the spark streaming the same way, how are ready getting transformed from one form to another form? We will have series of oddities all put together called as a d string so this term is nothing but it's another representation of our GD are like to group of oddities because there is a stream and I can apply the streaming functions or any of the functions Transformations are actions available within the streaming against this D string So within that particular micro badge, so I will Define What interval the data should be collected on should be processed because there is a micro batch.
It could be every 1 second or every hundred milliseconds or every five seconds. I can Define that page particular period so all the data is used in that particular duration will be considered as a piece of data and that will be called as ADI string s question explain casing in spark streaming. Of course. Yes Mark internally. It uses in memory Computing. So any data when it is doing the Computing that's killing generated will be there in Mary but find that if you do more and more processing with other jobs when there is a need for more memory, the least used on DDS will be clear enough from the memory or the least used data available out of actions from the arditi will be cleared of from the memory. Sometimes I may need that data forever in memory, very simple example, like dictionary. I want the dictionary words should be always available in memory because I may do a spell check against the Tweet comments or feedback comments and our of nines. So what I can do I can say KH those any data that comes in we can cash it. What possessed it in memory.
So even when there is a need for memory by other applications this specific data will not be remote and especially that will be used to do the further processing and the casing also can be defined whether it should be in memory only I in memory and hard disk that also we can Define it. Let's discuss some questions on spark graphics. The next question is is there an APA for implementing collapse and Spark in graph Theory? Everything will be represented as a graph is a graph it will have nodes and edges. So all will be represented using the arteries. So it's going to extend the RTD and there is a component called graphics and it exposes the functionalities to represent a graph we can have H RG D buttocks rdd by creating. During the edges and vertex.
I can create a graph and this graph can exist in a distributed environment. So same way we will be in a position to do the parallel processing as well. So Graphics, it's just a form of representing the data paragraphs with edges and the traces and of course, yes, it provides the APA to implement out create the graph do the processing on the graph the APA so divided what is Page rank? Graphics we didn't have sex once the graph is created. We can calculate the page rank for a particular note. So that's very similar to how we have the page rank for the websites within Google the higher the page rank. That means it's more important within that particular graph. It's going to show the importance of that particular node or Edge within that particular graph is a graph is a connected set of data.
All right, I will be connected using the property and How much important that property makes we will have a value Associated to it. So within pagerank we can calculate like a static page rank. It will run a number of iterations or there is another page and code anomic page rank that will get executed till we reach a particular saturation level and the saturation level can be defined with multiple criterias and the APA is because there is a graph operations. And be direct executed against those graph and they all are available as a PA within the graphics. What is lineage graph? So the audit is very similar to the graphics how the graph representation every rtt. Internally. It will have the relation saying how that particular rdd got created. And from where how that got transformed argit is how their got transformed. So the complete lineage or the complete history or the complete path will be recorded within the lineage. That will be used in case if any particular partition of the target is lost. It can be regenerated.
Even if the complete artery is lost. We can regenerate so it will have the complete information on what are the partitions where it is existing water Transformations. It had undergone. What is the resultant and you if anything is lost in the middle, it knows where to recalculate from and what are essential things needs to be recalculated. It's going to save us a lot of time and if that Audrey is never being used it will now. Ever get recalculated. So they recalculation also triggers based on the action only on need basis. It will recalculate that's why it's going to use the memory optimally does Apache spark provide checkpointing officially like the example like a streaming and if any data is lost within that particular sliding window, we cannot get back the data are like the data will be lost because Jim I'm making a window of say 24 asks to do some averaging. Each I'm making a sliding window of 24 hours every 24 hours. It will keep on getting slider and if you lose any system as in there is a complete failure of the cluster.
I may lose the data because it's all available in the memory. So how to recalculate if the data system is lost it follows something called a checkpointing so we can check point the data and directly. It's provided by the spark APA. We have to just provide the location where it should get checked pointed and you can read that particular data back when you Not the system again, whatever the state it was in be can regenerate that particular data. So yes to answer the question straight about this path points check monitoring and it will help us to regenerate the state what it was earlier. Let's move on to the next component spark ml it. How is machine learning implemented in spark the machine learning again? It's a very huge ocean by itself and it's not a technology specific to spark which learning is a common data science. It's a Set of data science work where we have different type of algorithms different categories of algorithm, like clustering regression dimensionality reduction or that we have and all these algorithms are most of the algorithms have been implemented in spark and smart is the preferred framework or before preferred application component to do the machine learning algorithm nowadays or machine learning processing the reason because most of the machine learning algorithms needs to be executed i3t real number.
Of times till we get the optimal result maybe like say twenty five iterations are 58 iterations or till we get that specific accuracy. You will keep on running the processing again and again and smog is very good fit whenever you want to do the processing again and again because the data will be available in memory. I can read it faster store the data back into the memory again reach faster and all this machine learning algorithms have been provided within the spark a separate component called ml lip and within mlsp We have other components like feature Association to extract the features. You may be wondering how they can process the images the core thing about processing a image or audio or video is about extracting the feature and comparing the future how much they are related. So that's where vectors matrices all that will come into picture and we can have pipeline of processing as well to the processing one then take the result and do the processing to and it has persistence algorithm as well. The result of it the generator process the result it can be persisted and reloaded back into the system to continue the processing from that particular Point onwards next question.
What are categories of machine learning machine learning assets different categories available supervised or unsupervised and reinforced learning supervised and surprised it's very popular where we will know some I'll give an example. I'll know well in advance what category that belongs to Z. Want to do a character recognition while training the data, I can give information saying this particular image belongs to this particular category character or this particular number and I can train sometimes I will not know well in advance assume like I may have different type of images like it may have cars bikes cat dog all that. I want to know how many category available. No, I will not know well in advance so I want to group it how many category available and then I'll realize saying okay, they're all this belongs to a particular category. I'll identify the pattern within the category and I'll give a category named say like all these images belongs to boot category on looks like a boat.
So leaving it to the system by providing this value or not. Let's say the cat is different type of machine learning comes into picture and as such machine learning is not specific to It's going to help us to achieve to run this machine learning algorithms what our spark ml lead tools MLA business thing but machine learning library or machine learning offering within this Mark and has a number of algorithms implemented and it provides very good feature to persist the result generally in machine learning. We will generate a model the pattern of the data recorder is a model the model will be persisted either in different forms Like Pat. Quit I have Through different forms, it can be stored opposite district and has methodologies to extract the features from a set of data. I may have million images. I want to extract the common features available within those millions of images and other utilities available to process to define or like to define the seed the randomizing it so different utilities are available as well as pipelines.
That's very specific to spark where I can Channel Arrange the sequence of steps to be undergone by the machine learning submission learning one algorithm first and then the result of it will be fed into a machine learning algorithm to like that. We can have a sequence of execution and that will be defined using the pipeline's is Honorable features of spark Emily. What are some popular algorithms and Utilities in spark Emily. So these are some popular algorithms like regression classification basic statistics recommendation system. It's a comedy system is like well implemented. All we have to provide is give the data. If you give the ratings and products within an organization, if you have the complete damp, we can build the recommendation system in no time. And if you give any user you can give a recommendation. These are the products the user may like and those products can be displayed in the search result recommendation system really works on the basis of the feedback that we are providing for the earlier products that we had bought.
Bustling dimensionality reduction whenever we do transitioning with the huge amount of data, it's very very compute-intensive and we may have to reduce the dimensions, especially the matrix dimensions within them early without losing the features. What are the features available without losing it? We should reduce the dimensionality and there are some algorithms available to do that dimensionality reduction and feature extraction. So what are the common features are features available within that particular image and I can Compare what are the common across common features available within those images? That's how we will group those images. So get me whether this particular image the person looking like this image available in the database or not. For example, assume the organization or the police department crime Department maintaining a list of persons committed crime and if we get a new photo when they do a search they may not have the exact photo bit by bit the photo might have been taken with a different background. Front lighting's different locations different time. So a hundred percent the data will be different on bits and bytes will be different but look nice.
Yes, they are going to be seeing so I'm going to search the photo looking similar to this particular photograph as the input. I'll provide to achieve that we will be extracting the features in each of those photos. We will extract the features and we will try to match the feature rather than the bits and bytes and optimization as well in terms of processing or doing the piping. There are a number of algorithms to do the optimization. Let's move on to spark SQL. Is there a module to implement sequence Park? How does it work so directly not the sequel may be very similar to high whatever the structure data that we have. We can read the data or extract the meaning out of the data using SQL and it exposes the APA and we can use those API to read the data or create data frames and spunk SQL has four major. Degrees data source data Frame data frame is like the representation of X and Y data or like Excel data multi-dimensional structure data and abstract form on top of dataframe. I can do the query and internally, it has interpreter and Optimizer any query I fire that will get interpreted or optimized and get executed using the SQL services and get the data from the data frame or it An read the data from the data source and do the processing.
What is a package file? It's a format of the file where the data in some structured form, especially the result of the Spock SQL can be stored or returned in some persistence and the packet again. It is a open source from Apache its data serialization technique where we can serialize the data using the pad could form and to precisely say, it's a columnar storage. It's going to consume less space it will use the keys and values. Store the data and also it helps you to access a specific data from that packaged form using the query so backward. It's another open source format data serialization format to store the data on purses the data as well as to retrieve the data list the functions of Sparks equal. You can be used to load the varieties of structured data, of course, yes monks equal can work only with the structure data.
It can be used to load varieties of structured data and you can use SQL like it's to query against the program and it can be used with external tools to connect to this park as well. It gives very good the integration with the SQL and using python Java Scala code. We can create an rdd from the structure data available directly using this box equal. I can generate the TD. So it's going to facilitate the people from database background to make the program faster and quicker. Next question is what do you understand by lazy evaluation? So whenever you do any operation within the spark word, it will not do the processing immediately it look for the final results that we are asking for it. If it doesn't ask for the final result. It doesn't need to do the processing So based on the final action until we do the action. There will not be any Transformations.
I will there will not be any actual processing happening. It will just understand what our Transformations it has to do finally if you ask The action then in optimized way, it's going to complete the data processing and get us the final result. So to answer straight lazy evaluation is doing the processing one Leon need of the resultant data. The data is not required. It's not going to do the processing. Can you use Funk to access and analyze data stored in Cassandra data piece? Yes, it is possible. Okay, not only Cassandra any of the nosql database it can very well do the processing and Sandra also works in a distributed architecture. It's a nosql database so it can leverage the data locality. The query can be executed locally where the Cassandra notes are available. It's going to make the query execution faster and reduce the network load and Spark executors. It will try to get started or the spark executors in the mission where the Cassandra notes are available or data is available going to do the processing locally. So it's going to leverage the data locality.
T next question, how can you minimize data transfers when working with spark if you ask the core design the success of the spark program depends on how much you are reducing the network transfer. This network transfer is very costly operation and you cannot paralyzed in case multiple ways are especially two ways to avoid. This one is called broadcast variable and at Co-operators broadcast variable. It will help us to transfer any static data or any informations keep on publish. To multiple systems. So I'll see if any data to be transferred to multiple executors to be used in common. I can broadcast it and I might want to consolidate the values happening in multiple workers in a single centralized location. I can use accumulator. So this will help us to achieve the data consolidation of data distribution in the distributed world.
The ap11 are not abstract level where we don't need to do the heavy lifting that's taken care by the spark for us. What our broadcast variables just now as we discussed the value of the common value that we need. I am a want that to be available in multiple executors multiple workers simple example you want to do a spell check on the Tweet Commons the dictionary which has the right list of words. I'll have the complete list. I want that particular dictionary to be available in each executor so that with a task with that's running locally in those Executives can refer to that particular. Task and get the processing done by avoiding the network data transfer. So the process of Distributing the data from the spark context to the executors where the task going to run is achieved using broadcast variables and the built-in within the spark APA using this parquet p– we can create the bronchus variable and the process of Distributing this data available in all executors is taken care by the spark framework explain accumulators in spark.
The similar way how we have broadcast variables. We have accumulators as well simple example, you want to count how many error codes are available in the distributed environment as your data is distributed across multiple systems multiple Executives. Each executor will do the process thing count the records anatomically. I may want the total count. So what I will do I will ask to maintain an accumulator, of course, it will be maintained in this more context. In the driver program the driver program going to be one per application. It will keep on getting accumulated and whenever I want I can read those values and take any appropriate action. So it's like more or less the accumulators and practice videos looks opposite each other, but the purpose is totally different. Why is there a need for workers variable when working with Apache Spark It's read only variable and it will be cached in memory in a distributed fashion and it eliminates the The work of moving the data from a centralized location that is Spong driver or from a particular program to all the executors within the cluster where the transfer into get executed.
We don't need to worry about where the task will get executed within the cluster. So when compared with the accumulators broadcast variables, it's going to have a read-only operation. The executors cannot change the value can only read those values. It cannot update so mostly will be used like a quiche. Have for the identity next question, how can you trigger automatically naps in spark to handle accumulated metadata. So there is a parameter that we can set TTL the will get triggered along with the running jobs and intermediately. It's going to write the data result into the disc or cleaned unnecessary data or clean the rdds. That's not being used. The least used RTD. It will get cleaned and click keep the metadata as well as the memory clean water.
The various levels of persistence in Apache spark when you say data should be stored in memory. It can be indifferent now you can be possessed it so it can be in memory of only or memory and disk or disk only and when it is getting stored we can ask it to store it in a civilized form. So the reason why we may store or possess dress, I want this particular on very this form of body little back for using so I can really back maybe I may not need it very immediate. So I don't want that to keep occupying my memory. I'll write it to the hard disk and I'll read it back whenever there is a need. I'll read it back the next question. What do you understand by schema rdd, so schema rdd will be used as slave Within These Punk's equal.
So the RTD will have the meta information built into it. It will have the schema also very similar to what we have the database schema the structure of the particular data and when I have a structure it will be easy for me. To handle the data so data and the structure will be existing together and the schema are ready. Now. It's called as a data frame but it's Mark and dataframe term is very popular in languages like our as other languages. It's very popular. So it's going to have the data and The Meta information about that data saying what column was structure it. Is it explain the scenario where you will be using spark streaming as you may want to do a sentiment analysis of Twitter's so there I will be streamed so we will Flume sort of a tool to harvest the information from Peter and fit it into spark streaming. It will extract or identify the sentiment of each and every tweet and Market whether it is positive or negative and accordingly the data will be the structure data that we tidy whether it is positive or negative maybe percentage of positive and percentage of negative sentiment store it in some structured form.
Then you can leverage this park Sequel and do grouping or filtering Based on the sentiment and maybe I can use a machine learning algorithm. What drives that particular tweet to be in the negative side. Is there any similarity between all this negative sentiment negative tweets may be specific to a product a specific time by when the Tweet was sweeter or from a specific region that we it was Twitter those analysis could be done by leveraging the MLA above spark. So Emily streaming core all going to work together. All these are like different. Offerings available to solve different problems. So with this we are coming to end of this interview questions discussion of spark. I hope you all enjoyed. I hope it was constructive and useful one. The more information about editor is available in this website to record at cou only best and keep visiting the website for blocks and latest updates. Thank you folks. I hope you have enjoyed listening to this video. Please be kind enough to like it and you can comment any of your doubts and queries and we will reply them at the earliest do look out for more videos in our playlist And subscribe to Edureka channel to learn more.