Video: R Programming for Beginners | R Programming for Data Science | Intellipaat
Hey guys, welcome to this session by Intellipaat. So, 'R' is a programming language developed by statisticians for statisticians. So, if you are interested in any sort of statistical analysis, then R should be your go-to language, and also 'R' is a versatile language. It is a great visualization tool, and it provides multiple libraries for Machine Learning algorithms. So, in today's session, we are going to learn about 'R' programming.
Now before going ahead, do subscribe to Intellipaat's YouTube channel so that you never miss out on any of our upcoming videos, and also if you are interested in doing an end-to-end certification course on Data Science with R, then Intellipaat provides just the right course for you. Now, let's go through the agenda for today's session. We'll start off with a quick introduction to R programming and then we will learn about R-Studio GUI. We will also learn about different data structures in R; data structures like vector, list, matrix, and data frames. Going ahead, we will work with some inbuilt functions in R and also you can put down all your queries in the comment box; we would love to help you. So now, without any further delay, let's get started.
So, 'R' is an open-source language. It is available free of cost for anyone to learn and apply. 'R' is a language which is little different from the traditional software building languages like C, C++, or Java. It was meant specifically or rather customized for statisticians who would not like to do much of coding but spend more time on understanding data patterns. So, most of the functions are pretty straightforward, easy to learn, and easy to remember as well. R is a language for data analysis and statistical analysis. R is a visualization tool. Before R, there's a language called S. R is derived from this language. If you read the history you'll get to know but that's not very important here. In R you can do the data visualization. It's an open-source cross-platform compatible software. It is a Turing complete language. I'm not sure of what this is.
I'll go back and check what this means. So, installing R: If you want to install R, you can go to this link and copy it. You can use this for installing R. So, all you have to do is open this link and click on download R for whichever operating system you have, like Linux, Mac, or Windows. Abdul is asking: Can you give an example of model building? That is what we are going to learn in this session. So when I say model building, quick example is: How do you predict data? We're talking a lot about sales. Sales is dependent on what? Sales is dependent on stock, the price, the quantity and quality. Let's take only these two variables. So let us say we have sales. I'll give you a field before we jump onto R. So, all of you can look at the link and search for this download for R for Windows, Mac, and Linux. this is where you have to click and download the software. So, sales is, say, dependent on your stock and when I say stock, it's the goods stock and not the stock price and then price of every product or maybe discount whatever you offer, and let me call it x1 x2, and x3, respectively. Sales is Y. Y is what you're going to predict.
So, what is model building here? if I want to predict sales, I would like to multiply this with x1. Say, 2 times X1 plus 2 times X2 plus 3 times X3 is my sales. This is what you've assumed that given a stock of 10 units and the price is $5 per unit, and the discount you offer is 10%. Your total sales will be 20 plus 10 plus 30, that is 60. So, given these values of X variables, the sales will be 60, maybe in dollars or rupees or whatever. What do we need to find here? When we are building a model and when you want to predict something, what is missing here? We want to predict this coefficients. We know X1, we know X2, we know X3, and we want to predict Y. So, the idea of model building is solving this equation. this is just one example. Solving this equation to find out these parameters, and these parameters are represented by beta 1, beta 2, and beta 3. So, if you're able to solve these equations and find out these values of beta, the predictions will be straightforward.
It's just like replacing the variable. This is the exercise of model building. What are the methods, what are the algorithms, and what are the statistical processes you can apply in order to find the values of these beta 1, beta 2, and beta 3. I will take up the next question. Is model building dependent on categorical data or business requirement? Of course, both. So any kind of data: categorical, or on business requirement or anything. Whatever the business wants; that is the end goal. What is the vision, and based on that you look at the categorical, numerical, whatever data it is and then try to build the model, and building a model I think I explained now. Just a quick info guys: If you are interested in doing an end-to-end certification course in Data Science with R, then Intellipaat provides just the right course for you. You can find the course link in the description box below. Now let's continue with the session. Nishan is asking why only 30 percent for testing data, train and test split which I talked about.
So that you can experiment with. Standard is 30 percent but people also do 60:40 or 80:20 depending on how much data you have. If you have sufficient data, even 20 percent will do because you want to keep aside some data for testing and that should be a good amount of data so that the test sample is sufficient to confidently validate the model. So moving on, once able to install R, the next one we have to install R studio. So R is the programming language, and once installed R, you will get an interface where you can work on, but it's not really user friendly and that's why we recommend installing R studio which is the industry wide and commonly used IDE or front-ends ID which can be used to work on R. So, the prerequisite to R studio is you have to have R installed in your system. So do all of you have the R studio link? you have in the links, and this is what you have to click on, R studio desktop. So once you install R and R studio, you'll have a screen something like this. Typically an R studio screen will have four windows. So, let us understand these four windows. So this layout is pretty common and famous in the analytics space, so even if I'm working on Python, there are IDEs which support R studio kind of layout for Python because this is what everybody is used to in analytics.
So the first one on the top left is where you write your code. This is where you your code and these are R files which you are going to go through when you start learning the syntax. So this is your space where you write your code, top left. So, let's say you are executing a line of code Y arrow sequence 1, 5 by=0.5. Don't worry about what this means. This is typically trying to create a sequence of numbers with a 0.5. So, I execute this. As soon as I execute this, the same code replicates here. How do I execute? Either I select this whole line and click on this run, else click anywhere on this line, bring the cursor anywhere on this line and use ctrl enter as a short cut. As soon as you run this, you see a replica here. This is your console. So you write the code here, and it runs in the console. Now there are times where you don't want to retain your code. You just want to use some code for testing.
So you want to look at what is there in Y. So Y has 1, 1.5, 2, 2.5, etc. This is a sequence. So, if you just want to test it and not retain it, then you can use the console. Of course, both of these are running the same session, same R session. If you want to retain the code, write it here, and you do a file save, you can do a file save here. You have a save as option. If you don't want to retain the code, you can just type in the console and look at the results. The top right one is where you have the environment. So you can look at which variables you have created, which tables you have created, what are the contents of this tables as well. for example, this is one table, this has 40 observations and 10 variables.
Click on it. I am not able to see; you can click on the content here, but this is not the best way of looking at it, so if it is a data frame, if you can also find it somewhere here. So I'll talk about that when we run the datasets. So for now, you can look at the contents at a high level. This will give you a summary of data. All the variables which you create will be available in this window. The last one, bottom right, you can look at the plots. So, any plot which is generated will be available here. I can export the plot as well using the save as image or save as PDF. There are a lot of other options. As and when the requirement comes, I'll share those. So you should understand, we have four windows, that is good for now. So R studio is a set of integrated tools designed to help you be more productive in R. It includes a console, syntax highlighting editor that support direct code execution and a variety of robust tools for plotting, viewing history, debugging, and managing your workspace.
So, the first and foremost thing which you should know here is your working directory. As soon as you open your R studio, you should find out what is your working directory. So how do I find my working directory. I'll just use getwd. This is a function in R which helps you locate your working directory. So getwd(). So this is an inbuilt function not a custom function. It is available in R. This is what your working directory is. Now let's say you want to import all your files from this path this file, but if you want some other path from where you want to import your files, you can set your working directory, set using setwd. You can just change your path. Say for example I want only D Drive Intellipaat. So I'll just type in D Drive Intellipaat and click enter. Now to check that I can use get working directory and see that my working directory has changed.
Now I'll get back to my previous one. I will roll it back for now. So, we will look at it when we start importing R files. So I altered it to my previous working directory, so, setwd and getwd. R studio options are accessible from the options dialog. Tools > options. So we have General R Options. we're talking about Tool > Global options. These are not mandatory, but it is good to have. You should know what are these and what are the capabilities of R studio. So General R Options: Default CRAN mirror, initial working directory, workspace and history behavior. Source Code Editing, Appearance and Themes, Pane Layout. These are more to do with appearance and default settings. All you need to do is you open your R and start working. No customization required. If you want the black screen and a white text font, you can do that under Tools > Global options, but those are not mandatory. No we are not going to use GUI. GUI is something R does not offer.
There are different startups which builds GUI on top of R. That is not free of cost. and I'm not sure about R GUI, what exactly that, is but this is industry-wide accepted format and R studio is widely used everywhere. R studio is also very easy to install the packages so what are the packages in R? So as soon as you start working, you would need different packages to work. dplyr is one package which you are going to learn in detail which helps in data manipulation, data wrangling with a very easy to understand syntax. For example, you want to do filtering of data, then summarize, then adding columns, doing a sum and group by. Then you can use dplyr. Now all the packages will not be installed by default in R. What you have to do is you have to go to this packages. Under tools, go to install packages. so it'll connect to the CRAN repository. CRAN is a global repository for R which is maintained by certain group of individuals. although it's an open source, there has to be some regulation.
So that is the website which I shared for the R installation. that is a CRAN website and it maintains all the packages contributed by different programmers and associations. So, if you want to install the dplyr package. Type dplyr and click install. I am not doing it now but soon as I do it, it will start running some command in the console and install it for you. Just a quick info guys: If you are interested in doing an end-to-end certification course in data science with R, then Intellipaat provides just the right course for you. You can find the course link in the description box below. Now let's continue with the session will not take more than a minute the bladder but other package which is called ggplot2 which is used for data visualization will take some time before it installs all those themes other information don't give and when you do it when you do that you'll observe that this is what is on in the console installed packages and will in bracket we have beam plan something like this little random and good if they're solid and you know packaging it with a default solution that is good news phones but you have this default packages like the blur and you've got to I would recommend you know how to install packages because we would need install a lot of new packages based on for example people who are who prefer SQL ready so we have SQL dear package which can be used which can be used to write sql-like syntax na so all this data wrangling at least not may not be an expert in our syntax in this SQL syntax to do the data wrangling in our that requires a custom installation of SQL package which I'll show when we when we learn the classification emergence then the next one is people who are comfortable and are used to table in a Cell so same drag and drop and pivoting like operations you can do in are using the fact is our pivot table so that's why it is important to learn how to install the package okay other things are not very important get SVN publishing and when please ignore for now is rarely used get an SVM or if you know what is SV in his radar gated so these are like Central Depository is where you can maintain the versioning of all of your code but yeah I mean you barely use it of course there are ways to install packages from outside the cran as well which is not recommended just like the Play Store or Apple Store in specially pays place to Android devices against all different have sorry but that's not recommendation that may not be secure they may not be stable way so it's good to install from crammed and if required you know you can also quantity will do this in open source setup radius also contribute free packages let us say there is a complex alpha complex operation for which you need to write a lot of lines of code multiple lines of course you can package it and release it in cram so there's a different process which you can read about it but the interests you can do that both are and buy them so this we have seen our studio GUI script window console window and environment window and the plots we have looked at are packages packages are so formally if you see packages and collections of our functions detained compiled code in a well different format the directory where packages are stored is called the library are comes with a standard set of packages or this are available for download and installation so Abdul is asking can we use our for data science lifecycle definitely that is why we are using hybrid so most of the things complete life cycles yes you can because our helps in will help you import data from different sources right you can do all kinds of data manipulation you can do all kinds of and all kinds of algorithms model building and then you also do data visualization right yes you can use but companies kind of use a combination of different tools sometimes they spend a lot of money on dedicated data visualization tools as well but yes you can use startups though sometimes rely only on our end by ten lengths either of these for the complete end to end said but then the problem is like when the data is at scale you will need some distributed systems with our is an in-memory in-memory kind of set of plate so what happens is let's say you have four million customers as soon as you do an import this rate dot CSV is doing an import so all the 4 million records gets loaded in this memory that is in your RAM so that's why you would rely on some other scale system like big data right odd is not sufficient yeah on top of this high cloud is useful so I think the important ones will cover so install dot package is used for installing a package you can also use a GUI for our studio and if you want help for a package you can use library help is equal to package name or alternatively you can also go to this help window along this bottom a window and click on the help and find a topic anything say GG plot – do we have the help here oh this is search window sorry let me type in and check if you're able to search yeah so every time this is plot do you get the full details about who is maintaining it and where the contributors you you also get the github link you can look at the source code that's the beauty if you want to add more to it say in ggplot2 this one charter is not available one fancy chart you want like a diagram or donut chart or whatever or some animations which is not available you can always contribute you can report bugs and then you can read more about it they wish is asking what is the data limit are can handle or manage easily I would say not for them 300 MB because three enemies data size post that you do not a lot of data manipulation when you do that data gets replicated in your ramp so more than three an MB not recommended for data size and the structured data will not be more than that right usually it also depends on a ramp so if you are having a 64 she will have to ram laptop who knows machine people are buying those GPU based machines today in these days then you can as well run like 4gb of data in your ram itself locals estimates if the data is in huge volume which tool is preferred instead of fire it's not about which tool is preferred or are you do all the things in our and you leverage the distributed systems which is an architecture layering our sits on top of spark or hives or Amazon Web Services in cloud services like – ah – your Google cloud platform so it has to sit on top of it in order to handle so much data you install are in your like we have installed are in my laptop Noren it also are in these big systems are is not integrated with any DV that is what they wish is asking is are integrated with any DV it's not integrated but you can always have older music connections there is no integration are is own format propriety format to store in the memory in the RAM and that's all it does is it better to have the working directory as an cloud folder line onedrive folder so row it is looking for a cloud for yes you can if you have enough space in your onedrive folder or any other cloud you can do that and not exactly sure you can do is in onedrive because one time especially I know it maps to your local system right you can really use a path if you have enough space yeah they wish our stores its own designers provide deform and in the what is ODB see ODBC but it's kind of a connector which connects so different databases ODBC is a protocol or an API which helps in connecting to different databases like if R has to connect with Oracle or teradata the backend systems I it cannot be always flat files then ODBC is used and all these setups will be done by the admin guys you don't worry about the setters and the data science enthusiasts in the scientist you should concentrate on hop once you get the data what you do about it how do you get insights out of it so all you do is we have an import data from CSV this is the most commonly used data format in the industry and most readers and this will get this CSV data for data exploration model building and once they're done we did a model building engineers will help you scale your model in the systems like connecting it to ODBC JDBC or – we are we get a hybrid airbases raid all these things we've done don't worry about that yeah I think Sai has given the full form of what do we see how do we decide if we have to go for our files and I think both are pretty much capable of doing everything pretty much everything Pythian is a general-purpose programming language I mean I don't owe you a biased review but python has a wider coverage because you can do everything right if you want to design an API or you know to run some networking protocols all use it for the designs you can do everything in Python you can do scripting as well right but the issue is all the fighting is pretty famous you would realize when you talk to full stack software engineers is not very fast and the learning curve is also slow you know it is built on top of C and C++ contrary to that R is a very easy to use and the syntax is pretty much simple a simple copy of a table from other table right in Python is pretty complicated if you not just do a say table 1 is equal to be able to do okay just a quick info guys if you are interested in doing an end-to-end certification course in data science with our then in telepods provide just the right course for you you can find the course link in the description box below now let's continue with the session r is also built on CC plus this what I'm trying to highlight here is python is marketed as a general-purpose and very robust language I do not completely agree with it although I use Python our Lord whatever I'm trying to series says R is equally good and something special makes it very much complicated when you want to do trivial stuffs like hopping and table so there is an indexing concept which which is which was meant to make it faster but it makes things very much complicated right so for any data science work I think R is equally good I would say R has to make sure more on the deep learning stuff the neural networks and the AI stuff you know it's not very then the open suppose contributors are not very active although I've seen like each a name algorithm which is available in our is in Python and vice versa till late and there are few statistics functions rate or statistical tests which are not really mentioned in Python for example I was looking for a mean when you test and some other hypothesis string which I do not find in Python so it's difficult to decide which are in pythons I don't have a conclusion yeah both are equally good it depends on the requirement yeah one complaint I have I would not say complain but one observation which data engineers are sharing with 3ds this like when when we develop some algorithm and share with engineers to scale it right we also of course sit with them and do that but one of vision is it's difficult to debug the our code when it is integrated with cloud systems so the logs are not very elaborate to understand where the system is free while fine 10 because on this path and everything rained all the systems have good compatibility it's easier to give up so that's that's one difference which I have heard lately ok I think we can take up this kind of questions I am fine with it will slowly increase in pace so really asking about how is data handled if data is in a language other than English ok this is an interesting question thanks for this so let's talk about some machine translation this thing right let's say you want to do some recommendations and you have all the data in the form of tests this a Chinese some and green paint and all it is and say there is a product which is written in Chinese and it has some characters say the product is bread which is in Chinese so number one there are different EPS language ApS which understands is correct us and you provide a screen you know a science and ASCII or they were number two it and you can uniquely identify a character if not is there are many interesting algorithms for example what – well what this essentially does is it kind of – all your Delta irrespective whatever language it is and based on the context window each in which it appears the co-occurrences of different words it can identify what are the other words which are similar to bread maybe synonymous to bread just based on the conference's so there is a Beautyblender science where you can irrespective of the language in which it is written you can leverage the maths behind it the the one of the features is co-occurrence to understand the synonyms okay and there are different aps which does it so far even form and really have utf-8 beauty of it I think is compatible with all the languages most of the common languages I hope that answers the question ok so so that's all about our main thing the bay six of our we will learn the functions in a bit so installation wise I hope you guys are able to install yeah so we can move on to variables so when we start learning our we should understand what are variables in our so variable is a temporary storage space where you can keep changing values like any the programming language we have this concept of variables and it is as easy as just typing in say X X is equal to two and then you type in X you get it – unless you want to do some operation X is equal to two plus three and then say X so this is fine so do you observe something there's a first time I'm using X and never declared if it's an integer or a float or direct or whatever so it dynamically it's dynamic based on the data you store in it it is related of course every programming language will have additive otherwise it cannot work as expected so here it is dynamic unlike okay we can compare with C and C++ you have to say in the fangs first right and then you say X is equal to declare the variable with the reading so it is user-friendly of course you know there are a lot of operations which happens in a back-end as soon as you say X you do in the back and it kind of reads and – as an integer and then assigns in as a property of this X variable rate so yeah it adds an overhead in the back end which has to do lot of operations but for the user it is user friendly so data types in are so data types are numeric character logical and complex so numeric all of us know any number like bead positive negative or even decimal is numeric detente and if you store X is equal to 5 which is an integer X equal to 5 point 5 it both of a decimal ok then X character you'll also store X is equal to hello world let's try it out I did not show everything so here it is as simple as Nick saying I want to convert X into a can't do it so I can directly write X is equal to hello world I now this becomes a character let's say I want to check the tile so how do we do I'm not sure about the command because okay this is class so I also kind of confuse between our invite them and that happens when you use these two languages interchangeably so use of class class of s gives you character weight and if you now say X is equal to once again and then say class of X is been it becomes numerator okay so numeric character and then we have logical logical is true false so it is more like 1 & 0 so if else condition you do an interest-rate you get true and false as two outputs so based on this you can design your code and develop some logic we'll look at it when we run some functions like checking if a string is a continuous in a string or if a character is present in a string we'll see and the last one is complex rarely used but all of you know what is a complex number rate it will have a real and imaginary component to it so like 30 minus 2 I so that is the real component twice the imagery component you can think of it like X and y axis the two coordinates 30 comma 2 and not sure how to use it let's see because I have never used complex numbers let us say X is equal to 30 but he s can let me work this a use case for it we can try that out yes so now X is equal to 30 minus 2 right and if I say type of things you can think of a use case for it I'm sorry class of X so it is a complex ok I'm gonna take some few questions is there an order in which we take self-paced courses I would say this is the order because what I do is I start teaching the designs from the basis as far as to the best of my knowledge so I think this is perfect to start with itself because data sense doesn't have much of a prerequisite we will leverage the elementary mathematics and knowledge about simple mean median mode all those things right which are using your high school mathematics we just leverage that those knowledge there no not much physics so you can start with this and on top of this you can start building up your knowledge on if you want to learn big data and other stuff slightly the visualization right but I think this this is where you should start with their interesting Excel yeah excel is a platform but whatever I was trying to highlight was you should know what this mean you know what is average rate mean then why do we need any medium what is the need for even having a concept called median so I'm talking about the basics times that is where we should start with do we need to brush up elementary maths if so from where can we do it nickel is asking ok don't worry about brushing it up now if required you know during the sessions you realize that something is going on top of it over your head and length and talking really very high level then do that although I make sure what we learn is like on the very basics and any 813-c a student can understand you not going to do any rocket science here just the basic data and try to learn the art of how to leverage that data and how to treat Anila very basics so don't worry much what mathematics the mathematics what is required I am going to discuss in little detail ok say you can declare both single quotes and double quotes theory and implement are side by side so the art file which you see on the screen that two of the three are files that is what we going to discuss as part of introduction to our and then we'll move on to the tribute or individual ation so we'll just work our data set we will have likes goes and columns in a structure and we will try to do some summarization like some and glow by and mean all those things ok so first we'll learn some data wrangling ok and then we'll learn some data starts and we'll come back to our to implement those charts so they'll follow the sequence which is more complex our advisor Python is more complex so we can get started with the concepts and send back an semantics with our there will be very emotional arts and classes will not stop multiple core data science or any data handling activities will not cover anything all those with it's fine' I mean this is pretty important to start with this will help you get started with or on board you know how you can leverage our for any of the basic statistical functions okay so the first one is data exploration where we'll cover objects in our meaning any programming language lab some multi site it could be in the form of placeholders where you can store your data right so we it is and these are all the rates from the traditional programming languages most of the objects are similar so we'll go through objects then flow control statements flow control is word about something like if this case when are maybe you know for loops while loops rate although you would not have need the loose a lot or the parasitic analysis when it's a good to have thing in it if you can automate your reports ultimate your data science this is this helps you you know do that then we'll cover a few inbuilt functions and move on to also you know defining how we can define our user-defined functions a course that will cover it and manipulation and if time permits will also jump onto data visualization so that's the agenda for today so objects in are you see this tree diagram I think it's pretty exhaustive one which helps you understand all the objects so broadly classified into one-dimensional and multi-dimensional what is one-dimensional say if we just want to store say color of all the apples in your database so it's like one column right so one dimension and then we can have multi-dimensional way there you want color and size and something else so think of try to relate it to the reader so one-dimensional multi-dimensional under one-dimensional we have homogeneous and hydrogen's and same for the multi-dimensional array of homogeneous and heterogeneous so homogeneous will explain what is normal innocent understand so it's an open source world there times when few functions you know may not work in the latest version and may work in the older version and vice-versa so what the only companies do they they kind of stick to one version of our which is like latest but may not be very latest could be like one or two months old for which everything is very interested and all the packages stable so you do not worry about that but that is what happens in the industry so maybe installed three point six point one you speak to that for at least six months or not unless you observe a major releases coming okay so homogeneous and heterogeneous homogenous example is erected and it is another example is list and at a multi-dimensional we have matrix as matrix as homogeneous and datas remains heterogeneous so what are these vectors vector is a linear object which contains homogeneous elements so it is a collection of values that all have the same middle name so number one it is a linear object and second it contains all the values of the same data type so if we look at it see one two three so all the contents are in pieces the second one C true false all the contents are logical either true or false right you cannot mix one Andrew and two and false like that so this how it is just go to the chat window so the easiest way to check if R is working once you open heart you'll get a similar console right for something like X is equal to two and then I finish you should get two if R is working Oscar is not working you might consider installing or studio once again but based on whatever you are getting or what is the issue you're facing you can you can write to support the presentation is available for everyone you can just log into a little portal and now load from the I believe we just started Krishna nothing as we missed capital is asking what do you mean by linear object okay so when I say linear object as I said it is one-dimensional we don't have two dimensions to it no rows no columns it's like this one row or one column that's it if you think about a flat file so going back to the presentation here so when I say linear object that means either 1 2 or 1 column so you can think of a data set are a regular data set in any of the ER peas or back-end systems then how does it look like it has rows and columns right so you have a customer table so a customer will have each row represents one customer and each column will be the customer attributes the customer ID lion to file the customer is customer income and so on and so forth right so when we say linear object you can think of just one column say customer leaves and that is what weather can represent it cannot mix anything it's just one single one single column and with only one data type next we evaluate homogeneous so see 1 2 3 see true false or maybe in the in the in terms of AIDS see 20 30 40 okay look so most of the operations and most of the details is what you're gonna do will be on something called data frame does not even use vectors for any analysis right but it is good to know all that it did data objects which are available in our but most of the things what we will be doing is on top of a data stream okay psy is asking if only one row is present then can we save cylinders because data set has only one row or one column if one row is present it is one row okay so there is no confusion only one row is present is called one-one draw any sample or we can use vector okay not really now so what I will do is we are only we are only introducing this terms rate so vectors definitely can be leveraged when you want to automate so I will give you a scenario and not show the solution for now so scenario is if you want to automate let's say you want to impute the missing values you have thousand columns in a dataset you need a frame and you want to import and fill all the missing values with some random number say zero or something for now although there are different strategies to fill how to the missing values what you want to do that you can leverage the vectors okay fine so creating a vector it's pretty simple C is this function so if you wanna create an America vector you just is C and then 1 comma 2 comma 3 comma 4 comma 5 this means every element separated by a comma is one element Prakash impute means replacing any value is something so let's say you have a lot of missing values null so n is and if you would like to replace all the nerves with a zero right and that is called including more on that later that's a simple table so that's where you can apply the vector so reading a mentor is it's pretty simple you can just use the function C and within brackets you can pass on the elements in a comma several format right so this is one way of doing it and the most the basic way of doing it the second way is say if you want a sequence Yunus do none to see of 10 to 20 so what this will do is it will create a sequence from the range of sequence 1 10 to 20 10 11 12 and so on till 20 okay if you want to go with the character vector so remember it's how much in a state you cannot have that a number in the same editor so all the elements have to be character so the format remains same see of it's a ABC so as it is a character it is important to enclose it with double quotes okay so you can also extend this to words anything close within a double quotes read everything includes double quotes is a character it will be one letter or you know combination letters like a swing so second vector is add to sign C within brackets this is sparta so this is this this is the first element is is the second element and Sparta is the third element right so see guys see here stands for create if you read the documentation C stands for create and the adder which you see here it's an arrow which is like an assignment operator it's just like equal to so you can either use equal sign or an arrow pointing towards the letter name so can two is the red team name right so this arrow is pointing towards well can name so let's really create these two vectors if I have this similar examples so you look at it vector one arrow see one two three four five six right you can either write it here or this copy this command copy this function and run it here in your console so what this will do this will create a vector with the elements 1 2 3 4 1 2 3 5 & 6 and how do you check it yesterday Fillion baby we learn this class right class we're Quan so what happens here is when you're working in our and say you have a complex code of thousand lines of code read and you want to edit or do some enhancements it's not like you'll always get a finish cold right so this is how you debug and checked each and every object in our you use the class command now here it says numerated as soon as it says numeric you should understand it's a numeric vector okay we will look at other you know read objects how what it denotes so far less will give you a list for metrics will give you metrics but they definitely will deter him so we'll check that in some time so we can also run the so I'm running this line now num1 see 10 to 20 so this is giving you old num1 what is num1 now you can check here so then 11 12 13 14 15 till 20 so what i'm doing here is if you can observe the code snippet which i want to retain are written in this file and if you were to do some just some debris and always i don't want a number one to build in here like this I just want to check what num 1 contains right so I did not I like you know make it from here and directly type in here to check so this code will give you the same output this like what you've done for the second line the only difference is this code will not the retain its only for your debugging ok num 1 is not an array guys num 1 is still a vector and they equal to sign and the arrow sign can be used interchangeably this is interchangeable to the best of my knowledge ok so the only advantage of an arrow sign is if you use a vector say this is my definition of vector ID I can just type in the vector first like whatever you want in the editor and then you can also change the arrow direction and say this is my num 1 so this just gives you as a comfort of you know where to place the definition of the very thin hand where to give the contents contents on the left alright that's it sign if you're getting some errors I'm not sure whatever you're getting yeah unexpected comma between values you just need to copy this murder should have that with you this be careful you know how many commas you have in how many elements you have but guys I recommend please do not practice while we do this class okay I really don't recommend that the reason is I mean you waste a lot of time these are pretty simple functions don't feel insecure you can always practice at any time it is a very very simple file so don't feel inside of the syntax in tax is pretty easy R is a very easy to use language you can see these are like those simple functions just one time run and we will remember okay so did we create character vector we don't have okay that's fine I need not create it and they know everything so this is calculator okay next one is creating a logical letter so it is very similar you just need to say C of true/false to draw force whatever you want so to represents one falls represent seal and these are key words okay these are key words you cannot say TR EU or something it has to be TR UE from really and tftf is also it also represents true/false reverse it with us it is a short form of – okay so one thing we can check quickly is I am able to create these vectors so I have one vector which is which says to false oh I'm sorry I just messed it up okay so vector 2 has true and false two elements now you can ask me where can you use this true and false so this is where you can leverage this for any kind of automations say or you are attempting a very complex programming no challenge where you have to automate all your data manipulation this true-false true-false will really help you want to store say do you have is it a null or not right for all that occurs then you can store it in a forward and then pass it on to some of the system I mean these are the scenarios but you cannot define a specific scenario in like once you understand what can ever task you have in analytics it will be easier for me to explain where you can use a relative to automate your designs stuff later angling stuff okay now if you want to impute your data rate impute all of you know impute means replacing an element so let's say if your vector vector has some null values and null in R naught means something okay in any database we have something called null so not a little bit has nothing a missing value or a bland so gnarling R is represented by any this n a so let's say we have a vector which we create we have few pennies if you look at it vector – has anyone to nxd5 any it has like how many beep steel seven animals late and we have three missing values how do you tell takes a very very long because a huge vector right so you can just say is that any it returns a logical relative true false true false was rated so true means yes the value is any false means the well is not in a raid so this function is that in a will stay with you for a long thing this is helpful in imputing missing values which is one of the most important data science activities so we'll spend some time here on this topic okay now let's say if you want to impute and replace all this NH with zero so just use a simple if-else statement inference is a function it's again an inbuilt function if else and within brackets thus we have three parameters here okay one two and three so first one is the condition is not any first one is the condition the second one is what do you want to impute it with if this condition satisfied then what do you want imperial and if it does not what do you want that's the the way you write – statement so let me run this and let us take the victim ah so 0 1 2 0 3 5 0 all the anisa imputed by zero until the arrows are just for direction so if you write the definition first then you use the second one second option which you have given and if you have the milton name first when you give the first one it is just 2 it just gives a direction of what to process first and assign the world so the processing part will be first rate and then assigned to some variable so either way is same so this basic data object Inanna moving on okay so you can also find out the length so if you say length of the vector name will get the number of elements in the relative okay so few functions you can try yourself pretty straightforward accessing the elements of a vector so how do you access if we just want one of these right you can this a vector name and within bracket which element given axis 1 2 3 so let's say we have I think I have it somewhere else let me just find out accessing the elements I don't die if anything for now yeah I create fewer editors okay so length of number we have num1 is having five elements rate so we have five then length of Calvin Calvin has three elements so it gives us three right then if you want to access a second element of can to will is capital this is can to right and we want a second element so all you have to do is just say care to care to and password to so this will give you it's either the second element range if you want the first element you just say care to one basically with this way additionally you can also go with it so if you want the one more than one element will print a rate or return so for example for my love to you want the first and third element so what do we do basically expects a vector of 1 and 3 so this is like we won first and third elements so true and true first and third it starts with this one here you'll observe your rate when we are saying data to of 1 it gives me this okay I think we have seen this imputing missing value next one is lizard list is a linear object which contains heterogeneous elements a list allows you to gather a variety of objects under the name a list may contain a combination of replacement rates those data frames and even other lists so so pretty powerful so what it's trying to say it's again linear object it's like although I would say it's not strictly linear village we'll we'll see that how it is like one single list of the names of this pixel in a table there it has one single list and it can contain heterogeneous elements as well and a syntax is you just say list within brackets say an integer a number one not one and then a string Spartan now these are two different data types but can we fit into one single list this can be packaged or zipped into one single list right so that's that one days of having a list you can have multiple data types in one single object now what does the last sentence mean it may contain a combination of letters matrices data frames and even other less so this means within a list you can have let's say this one number one string and you can also have other list matrices data frame so the first element could be just one vector filter of length one second element could be a vector of length 1 1 element could be a whole data stream meaning thousand rows and n columns or the fourth element could be a matrix matrix so that's the beauty of list now how we apply it depending on the requirement Frank I have not used such a complex a still not because most of the operations are in detail frame you can think of some way how you can leverage the compose the most complex form of the list so creating a list using the list function you just say list within the brackets you pass on the elements ok so you can observe here the first list has all the elements of same length one a and true although these are different types first one is numeric second one is factor third one is logical but a lot of same length same for the list tube we have list C of 1 to C or a B and C of true/false all of same length there is two elements it in every element of the list are real answers so you just say within double brackets you give the element number which element your analysis so say second element is say mild s1 within level brackets to okay what happens in a list is because we have multi levels here right within an element I can have multiple elements so I would I might want to access an element with an element of a list so for that we have this double bracket Authority so say I won the second element of the third element in the list so I say my list too and then within double brackets three and then within single brackets or two so this will give me false so three and the second element is false right this will give me false okay let us execute these we see if I have this handy yeah so my list one it can always print it all your reference one a Andrew and then my list to say my list – I have tweaked the length the third element is has three sub elements right so you can also do that it need not be always of the same length first element is of has two elements second elements here two elements third one has three okay and you'll observe I am using the C C's index within a list so we didn't list we can have a back tonight these are all vectors these are all relatives so if you want the second element of list one what is the second element two lists one just to make it more interactive my list one what is the second element two second element is – is it it's a right second element is a so if we do this – one of two the two of my list one this will be okay let's tweak it a little bit so I'll say – I'm sorry two and one so I what what am i doing it what gives – two of – one what is the expected output if you guys follow me Lily – – I am accessing the list – two and two to one I passed within double brackets to every single brackets one okay let's see perfect so this is what we expected in no we'll go back and so we can also name the elements of a list so what happens in a complex now is when you want to create you don't want to make it more user-friendly right to access the list you can just give a label to every element so syntax almost remains the same because we can say list and say first element is 85 second is 45 third is hundred what you want though specifically provide a label you can say Apple is already five most likely it is like the price of Apple per kg and then ISO banner per kg and by the world per kg rate so labeling the individual elements of the list is or you read list of label equal to the element level element like that so what this does is it makes it easy to access the list elements we just saved the list name and then dollar the element label and you it has 85 so I'd read this list and I can access 85 if you're – banana got a label just do this and replace apple vana our put should be 45 right no can just scrolling through the chat window to check them there are some queries so nanda is asking is object always associated with storage only so it's about store is yeah it's about storage and how you store it yeah and not sure what else normalization could be this for how you store it and then while data wrangling although you can leverage it so yes then we can define an element with just data within an object we can do that alien is like one row of data it's one dimensional okay if I happen to miss some question let's add windows in this coffee place and put it once again yeah dollar is used to get the value when you have a labeled list only in the case of label list number the name of elements has to be unique that is true so if you want to use the names within the little beyond the list right you have to use the dollar per atom this is named or a banana this way it will it is very very difficult to relate this to the real name datasets only object which is closer to the ultimate assets is a data frame and then is worth even over the most of the times when it's my job to explain all of the datatype state before jumping on to data frame okay yeah creepy so C of one two is given so that we can return multiple elements in one go so if I just say my log of one it will give me first element my log of 2 will give me second element but if I want both one first and second I have to pass C of one two and C means field create a vector so it is a vector of all the indices within a list and yes there are always guidelines for naming the variables you need to follow proper guidelines based on domain which you are working these are just examples but will me work with real time today you'll see how the way it was locally at least can you try the duplicate one at your end you know I just need to really cover other things so you can try and let me know if you get some error if we're gonna get we learned something new okay fine remember these are all I would say just the introduction to all these objects most of the times or I would say 99% times he'll be working with a data frame so that is where we will spin spend a lot of time on the data frame him so like a lot of objects which are seldom used you also have metrics for that matter in the next one is matrix I don't remember I have used it any time in the past till not until you go for some recommender systems where you only have one single and I read his number you know convert everything to a relative for example in matrix we have like thousands of movies right so every movie has to be the power editor in the fourth numbers so there may be where this will be useful but even that can be done using a little flame so data frame is a one-stop-shop for you all the little manipulations can be done using an inner frame but let's see what is the matrix matrix is a 2d object which contains homogeneous elements again two-dimensional meaning rows and columns if you see the output here we have rows and columns we have two rows and four columns in and the way you create is so the easiest way to do it is matrix c128 so we give a sequence and say 1 1 2 8 as my elements 1 2 3 4 5 6 7 8 and we want it in the form of two rows so it automatically splits your Delta into two rows okay so creating a matrix the syntax is just say matrix so for creating a vector you have C right just you see that represents feeling a vector how do you just pass one list lis see lists within the brackets you person elements how do you create a matrix you just say matrix maybe Ras matrix and within the brackets we want elements but because matrix is a two-dimensional object we need to be careful with how do you want your image to be so the first the first parameter is you have to say what elements you want element is so let's say I want one two three four which is a vector all these elements I want in a man wherever you see you observe this sea rail this guide to see this means it is creating a military so mental of these elements are there one two three four I want this in my mantis and how do I want it I want two rows so two rows so automatically if you have four elements and do rows you'll have two column sale so we'll have 1 2 3 & 4 and the third one is by 0 is equal to true that means not assigning the elements to the placeholders and fill that Oh first so 1 2 & 3 & 4 ok then a slight variation you can save see ABCD again we have four elements and or equal to ok here we don't have any variation it's only filling the characters instead of numbers so you can do that if you do by du equal to false okay if we do by Rho equal to false so what this will do is it will first fill the first column a B and then C and D that's the only difference and you can also do it for the logical elements TFTs and you give the same output now how do you access the matrix so you need to use basically you need to pass on two things here one is lower number and the column number so if you same act one of two one two represents the row number and one represent the column number so the first one is the row number second column number now what happens if you just say mat 1 1 comma another that means first row and all the columns and if you same at one-nothing comma 1 comma 1 so that means all the rules and first column so this has length coordinates x and y coordinates to access different elements using these two parameters row number and column non-symmetric name within square brackets roll number column number if you remember the transpose so if you want to change the rows and columns in a matrix so you can use a transpose function okay so mat one is one two three and four so let us see one of our transpose so you just say P of mad one so what this will represent this see one and four will remain same because this was 1 1 into 2 so it cannot exchange the rows and column number 8 but the other two elements 3 & 2 rate see was 2 comma 1 & 2 was 1 comma do she just exchanges so all the elements except diagonal elements will get it exchanged numbers but not a combination of number and complex numbers so if I go back to the side matrix is homogeneous so whenever we have homogeneous rate only one biggadike is allowed so that's where you can use a data frame which is more realistic okay I'm just going through we did not get by row okay we'll see that now so let me open up and get this matrix so this is my mad one well it'll be pass we passed four elements one two three four we pass this to the matrix function and we said we want two rows and we also said by equal to true so what this did it filled elements one two three four so it is filling it by row row first first row fill everything then go to second row let me try by Rho equal to false so it did work and what is happening now we are not filling it by row now we're filling it by column first fill it in the sequence how to fill these elements in a sequence so fill the first column one two then go to the next column three and four we didn't miss any question Manish is asking you do not use C in the fruit list yeah the fruitless tree is the label list right there C is not required because for every element we have a label so that's a couple a difference saint-denis don't worry not the number Sona syntax of this data science is not about your hombres indices don't worry about the coding what do you guys can do is you can just leverage this PDF for this our fire rated as a I would say it's a reference ready reference first in taxes if you don't defend ring-opening right so that's all you should use it for don't try to remember it by heart I mean nobody expects and if you go for interviews right in a one-hour interview I guess right inverse will be a coding one they just want to know if you can if you know RN leverage all these things that's it nobody's expecting you to do a lot of coding when they the frame is available why would one need matrix I am so I can give an example matrix is faster when you have homogeneous elements and when we want to do something like cosine similarity that's what you might not know what is cosine simulant if you know well and good so things like you know vector multiply mean two matrices multiplication may it is easier when you have the matrix object instead of data frame data frame is assumed to be much more complicated in terms of how it stores data in the backend right because it has to be in the form of rows and columns with every column being heterogeneous right it's not similar one column will be number the column would be a date type format rate so what I'm trying to say matrix will be useful when you want to do a lot of mathematical operations on bigger matrices remember the mathematics where you would do mathematical sorry matrix multiplications so it is difficult is explaining now once you understand how a recommender system works and how this cosine similarity is you know used then I can give an example of how matrix is lit and my second answer is to this question I mean this programming language is a war rate it has to have all the objects and then it goes to Datagram it does name is the most immature object so initially these all things came this may not be used all of these may not be used but these are there in the programming language data frame is enough for most of the operations and that is why as a program as a if you have something in armory you have to check how you can really leverage them so one thing I remember is matrix gives you better performance when you have homogeneous kind of scenario you have to multiply two matrix and by cosine similarity and give you a little bit detail so what happens is every again then it will example every movie every movie can be represent in the form of matrix when I say matrix matrix will be in the form of say the feature rate movie could be from a Yahner movie has a length like how long is the movie movie has some actors or whatever so all these features within a movie could be the sum of numbers and that will nothing BMN nothing but a matrix if you want to understand the simulation into movies that is nothing but finding the similarity between two matrix so if you go by the geometry and you know calculus we shall learnt in the high school you can multiply these two matrices and find out one single number which will give you the angle between these two so it might go over your head for now but bear with me I am giving a high level example so that angle will tell you how different of these two movies based on all the features which were embed in the matrix and in these cases matrix might be little faster compared to this multiplying two data frames okay and by the way into the frame I used no need to remember all the syntaxes not at all okay I is asking if placing by or equal to falls it all already looks like a transpose matrix right even in a transpose okay once you already fill a matrix with say thousand rows and thousand columns do not refill the matrix ray it's only while filling them into his first rain and pose that if you want to transpose it then swear transpose is used and by the way transpose has a lot of other applications it's not only just here so you might need transpose to find out the matrix inverses if you go with the linear algebra concepts say you want to find out same matrix a multiplied by X gives me a matrix B so if you want to find out matrix X that's where you will need lot of transpose you know operations so more on that later if you want to go deep you can start learning linear algebra concepts then linear programming concepts on top of matrices so transpose is a useful function for that okay so did we create went this one yeah matrix Y is created we also created matrix 2 which is all is all kindest so we killed it now ABC and B then matrix 3 we all all of these are logical elements true false true false you can print it John yes when two then man 3 all these you can print then you can access the elements if you want the first first draw all columns you can say mat 1 of 1 comma everything you get 1 & 3 then all the rows first column then you get 1 & 2 then second row and first column within this 8 to 1 and then if we do a transpose gives us 1 2 3 & 4 which is the Dynel event will remain same one for only the non diagonal elements will change and if you want to find out the average of say all the rows only second column rate so what do we have in the second column let's see in the matte one what do we have in the second column second column is 3 & 4 rate so what this function is saying mean of mat 1 comma 2 so what is there in mat 1 comma 2 we have three and four brain all the rows only second column and what were the mean mean would be does the average of three and four which is three point five three plus four is seven and seven we're way to do is three point five nine same thing and do four mean mad went to comma this which is three okay so observe the earth that the original matrix and the transpose matrix adjust length reversing the coordinates row and column that's it okay for the non square matrix I think I just executed hit here so what I meant by saying non square matrixes the rows and columns are the role length environment they're not equal in your two rows and three columns here right and on top of that if we do what and suppose D of man so what it does is it just change the coordinates so for example for the element one what was the row number and column number it was one one rate so if we just change it again b11 outward three three was one common to so it will become 2 comma 1 see ya two comma one right how about five five is 1 comma three in the original matrix now if you just interchange it will be 3 comma 1 right 3 comma 1 similarly for all of the rows this is what we've meant by the non-square matrix although in the linear programming paradigm you have a lot of theory around this concept and how to find out transpose inverse and eigen values and not going to details details now but those are also in a good to have skills in the long run and what asking you to learn now but revisit those concepts and talk about that and we'll learn few items and the next steps what you can follow out of this course how to view the top left window Abdul ok so you can go to the file menu here and this same new our script file new a script and start typing there you can also use the plus icon which you have it here the plus icon and say our throat by the way our studio has also started supporting Python and the X C++ and other files but we will restrict ourselves to our for now again I'll move on to data frames instead of learning the areas for now next one is data frame and then I will talk over there is so let us first look at data frames yeah so what is our data frame I am skipping the erase concept again not very important I first want to cover it as aims so a data frame is a 2d table again it's only two-dimensional remember so how what matrix what was the dimension for matrix 1 D 2 D 3 D this to derail rows and columns similarly our data frame is also a 2d table where each column comprises homogeneous elements and each row contains either homogeneous or heterogeneous elements meaning it is a spreadsheet data what you see in a real-life wait so the basic syntax of creating a data frame is data dot frame and then you see the first column name name is equal to C of Sam and Bob so this is what we'll have in the name column and then comma 8 is equal to C of 32 48 so this will give it this result name is the column name is the column name salmon Bob 30 to $40 two elements so one element from name column and one element for a scholar this makes one row Sam 32 is one row Bob 48 it's okay now how do you access the columns let us say you have a data frame with these seven different columns and you want to access one column the data for one comeliness a data frame named dollar column name this like the label the label district we use dollar so here column names act like labels so let's think if you can create a data frame now yeah so we have this data frame salmon Bob we create this and different axis I am NOT I sign it right so let's say I'll say test DF I'm giving it a name is equal to zero or frame so this is my test D if I can print it completely named age schambach three to forty eight rate so to access a column named SDS I have two options let's say I go for name here's me Bob and Sam or I can go for these and we create a cube the wish is asking cube is three-dimensional yeah so create a cube like structure you can use Ares which is not covered now so not only cube you can have three four like thousand multi-level multi-dimensional objects now where would you need that and then you could so what I do is I will create a requirement for these things where would you need a cubical structure all that's not very common but yeah we'll talk about that it's interesting you can use it using Ares okay now what it of frames generally not create all the data here rail you'll not do it in line because data frames could be huge you could have millions of rows so one way to do it is we import it from an external CSV file or other data files there is no what is what is a CSV file I usually am coming to that customer churn what is CSV file and arrays know that yes I mean we can extract browse we'll see that now so what is a CSV file yeah so it's a comma separated file and each column is separated by a comma so I have a file in the back in and just passing this and giving the path and the filename and then I am using real dot CSV function this is the most common file format used in the data world CSV is available now the reason is in kanji is very less space so I am put in my file and I want to check what is there in this file so let me see command works view of customer Chen on the it will be huge it's not recommended to view the full data set I am taking that risk so say in our studio if you pass use this command view of customer churn you'll get this kind of you know structured data format and this is what we have imported we have customer ID genders you know is the customer senior citizen zero or one like with lot of variables is address to apply scintilla commentator said is the customer a partner here so node is does the customer have any dependents what tenure of customer ease of customer in the system and then what all services he or she has subscribed to phone service multiple lines in different service online security and so on so we have a lot of columns and then in the end we have a label cirno-chan admins and customer left and switch the network and I mean no means the guys still with the system he has means the customers love the system ok so this is how our data frame looks like now I just wondering and silly you know once you import the data you click on this and view the data set but somehow it's always happening so instead of using the view command you can also click on this customer underscore charm so this is this gives you a better view of the data you cannot do this in the arc you cannot do this in basic all you have to go with the command the GUI interface looking at this data and such a structured format is not able and so we leverage our studio okay are you asking does this reading file not record any modular package no this is by default there but there are lot of customized packages which can help you read data in a very short timeframe so it will help enhance the performance so those customizations are available you can browse if you wish to but read RCS is the most common way of importing a file if the only one to look at gender I can just estimate dollars in there and this gives me the gender of all the customers and that is not the best view the reader of the best way it's good to go to this view window and look at the gender for every customer because you want three-dimensional build I mean you want the 360 review and you want to look at all the other columns and then the gender it in a special format okay how to access a column using the column number well it's not always easy to access the columns with column names let's say one you want to plan an automation you want access the third column says you can also do that using customers you just give the data frame name and with the brackets something common three so blind comma three blank means all the rows and this link matrix all those and third column this is what you get so these are default our output and this is what will get it as an add-on in our studio so on the console what you get is the way our worse and this is how our studio will give you in a more structured way so it's really easy to understand baby and if he wants let's say three columns you can just say customer churn all the rows comma C 1 3 6 will give you 1st 3rd and 6th column of this period so if you see we have 1st 3rd and 6th column I cannot scroll up it has a lot of records now how do we see the number of Records we want to look at number of records there's the first thing you would like to do rate as soon as I put it up so what do you do n row of and row of them made a freemium so this will give you so we have 7,000 43 goes 7 0 for 3 okay and dim command we can use Tim and this will then we'll give you the number of rows and columns as well so it has seven zero four three rows and twenty one columns okay now lot of other things you can do something called STR HTS times for structure structure of this different so what is such a command structure is telling you oh yeah you see a summary we have seven zero three observations 21 variables we have all these variables and for every variable we have a data type in factor in factor what is a factor we will discuss and then for every variable we have the sample records link for gender be a male female for senior citizen we have zero and one link but you logic all the reports it is somebody high-level overview of what your dataset loss length okay so let us go a little away from the scene – it might be later board rate what is the factor n I think we do not cover it you know it we know a number in pijl then character and then a logical this thing okay what is a factor and what is the need will not cover it completely but let's understand what is the need we have discuss rate of machine only understand numbers so all the categorical variables like cattle is equal to is equal to main female right yeah how will the Machine understand male and female so you can just say okay why not here will some labels like male is one and few minutes – okay so what happens is will you pass this to a system to the machines when you're learning when you are applying all these algorithms laid in the backend internally it will try to kind of assume say male is 1 and female is 2 that means females are by default better than means right or vice versa it could also say one like mains are better than females of there is an M in both ways but that's not true but that's what we did in kendo if we do that we can pass on one and two labels so that is why the year we don't want to clean in the biasness we just want to say okay these are two categories we don't know which is good or bad who is performing better or worse they may be based on some other variables like analyzing the math so average mass of students they for the male and female students then maybe it will decide based on some other column but just by looking at this column our gender we don't want the machine to decide any weight so for that please this factory type was introduced which basically means these are two categories and there is no different in these two there's two categories and in the backing how it is done is within the art system it basically follows the one hot encoding paradin so for all the categorical variables so let us say I give gender right and I have like male female female then male than female I have like five of course so what this will do is it will create two variables gender and ESCO male and gender underscore female and it will pass on the numbers so if it's a male male is 1 this is 0 if it is a female this is 0 this is 1 this is 0 this is 1 this is 1 this is 0 and this is co this is 1 likewise for all the records so this – these two columns will be passed to the machine or that software or any other which you're working on okay so these two only passed under the machine now you can extend this to the other legs have more calories so instead of gender we have some other event gender we have three categories right which is socially acceptable rate so if you have three categories then we will have the third category here and then based on that one zero one this is called one hot encoding is also known as why not encoding or dummy variable creation for categorical variables so what happens in our you did not explicitly create this as soon as you declare a variable as a factor that means internet is feeding these values okay so I will quickly address none dispersion instead of always important attend an object you can read and process it on the table directly okay so it doesn't have that functionality of the operation table it is an in-memory programming language is always important as the operation if you want to do that there are high system side you can call hives parallel processing systems directly some are we will pass on all the functions from car and use it as a pipe and do the operational table and come back so that facilities there is called piping so if you have with only ours in this or less the only pythons in this so that piping can be done right listen taxon RL Python and then call it from a higher system which will do all the manipulations and pass into the system in Italy there but he doesn't have these things by default how well can we are to ensure enough remand processing is delegated to our to function okay good question so if you would have observed while installing our studio robot something will ask to do server so these are fine server architecture in an ideal scenario our is okay our is installing your local desktops and laptops in the machines but when you work on the projects and when you are uploading your code right Pauline codes the codes are always gonna server and the server has the RX fault for you and it follows a client-server architecture so you have one centralized server which is like it is even higher energy balance this will be an example and all of the our clients it like you as a user will use one all train other developer we use other our claim to the server access the RAM so that's what that management happens and our studio server use the place where you can do all these kinds of many so don't worry about that you are not going to work on any admin stuff the image will be able to easily handle it can we get all commands shortcuts in one document for a future reference and application that's so you have got the date Suresh further and if required I can share some cheat sheets link those are readily available in the internet you can download some cheat sheets like the deep black package which we are going to learn in detail you'll have a one-page cheat sheet for the plan and that is I think all what you would need for any kind a line activities so we are going to learn redeploy shortly any data frame size limitation in path so make sure is asking I mean how big data that depends so Delta frame size also depends on how many columns you array so there is no official limitation but it depends on your ramp as I said what time you have and how much data you should process how we imported the rate of frame we use the realtor CSV command we passed on the part of the CSV file and then we just use this arrow clinic that's all the data should be loaded in this customer generator set this is the data stream name so if you do a time sorry the glass class of customer churn so remember I told you every time when you create one object it's a good practice to look at the class it says data or you know it's not a list or directive modern nor a matrix matrix right it's a data frame so there's always a script so it is very less it's in KB's money yeah and we are acting as clients when we when we follow that client-server architecture you suppose CSV file is the data set and most of the times in the produce aisle so what data engineers or other you know source application teams will do so for example you is what a requirement where you want to mind how people are behaving in the when are they clicking any other hovering over there it might be going to the pre option and then stopping they write due to several reasons it could be bad UI or something something else so all this customer journey in the app or website is captured in those likely stream and click oh I think there are few tools so if you want if I want to work on these systems late basically what you do is basically what you do is you get access to your systems and they start working on it so they will kind of fix forward and gives the rate or the share the data set in the form of CSV to do all the VOCs once you're done with that you'll have a data engineering and software engineering team who can help you plug and play this lor ISM all the model statistical model into their systems okay did I miss anything ok so as I said what is a queue we have a two-dimensional data rows and columns but what is a queue take the X data so as I said any takes data like a movie name how will you represent a movie so first you'll have like 1 million movies and for every movie will have certain characteristics length length of the movie the actor acted in the movie or what is a Jana of the movie right and whose there were a lot of things right so now these are just two dimensional things moving in and then different factors now how does it become three-dimensional if all of this practice I convert into an array like just the movie John and can be an airing how are you it is two dimensional array or matrix is two dimensional so movie honor can be split into length what is the honor type and then within the honor what is the sub Jana rate and then will the honor is it a mix of multiple honours just length every single word can be represented in the form of in the form of one matrix or array so that's where it becomes three-dimensional after this data set if you go one level more that becomes a three dimensional which will i mean you will realize in sometime after one or two weeks rail once we discuss few data sets you will realize how that happens and what is the need of creating a three-dimensional it is especially in the case of text mining and recommender systems we will need more than two dimensional with every data point is for the split in two letters so whenever you want to compare two words read we need to know what other hidden features we have in the world like an apple and AB you know mango can be compared and say okay these are very very similar because both have rules but what about the context is Apple use in the context of company or is it a fruit if that was the case in when in the context something different then if apple and orange are not very similar right so that's why these are the hidden features so every word is also split into a matrix in the Bible so further creating other dimension to this data that is where we need three dimension for Emmett's length and so on you've got an intuition more on that later I would say you just follow the file what is the error for CS we are getting I use you said you're giving some error so make sure the path is correct and the slash never uses slash I don't know what it is back or forward or you call it but this is what you have to use and by default if it isn't windows the slash will be different to be the other way round so either you can do this or double of this so my default leave the slash is like this we had we use two slashes in Windows this will also work just make sure you follow this index hopefully we've got CSV why are you using CA we are not sure why see Avery it should be real got CSV not CV okay fine we'll go through some of the things in detail frame okay so when you have a dataset you also look at say well once you start exporting right what did we see we looked at number of rows they mentioned in the structure right then other way to look at it I said is if it is huge so as soon as you import you did you should not save you it will give you the whole data set and your system might get stuck in hand hey so better go fall head will give you top six records by default you want to customize you can say head data frame comma 10 it will give you the top 10 records held data frame comma to two records right same 14 they will relieve you bottom 6 records so let's say if you don't have a studio is how you start looking at data even in case of our studio I mean this thing might get stuff if you use this like all the time is a huge water did I it's like a replica already this much it is there in your RAM and then you click it kind of displays it here right so this weapon this interface we also consume a lot of space if you have so much data here then you can do tale comma 3 comma 10 8 n 2 and column using other important one which I think a miss tortoise you can look at summary so you can just say summary of the data frame very useful because initially now when you do the data exploration you have like millions of records these things are very handy in one one single shot is able to see so many things so what is somebody giving us somebody will give you as the name suggests a summary of what that it does it is all about it is basically giving you the summary statistics let's talk about column I will choose one color Daniel so under tenure you give you what is the minimum tenure and it is red what is the maximum what is the first quartile median third quartile and say I mean mean so don't worry about quartile will learn quartile but this is giving you an overview of water data is so what is a range minimum value of maximum value average value median first quartile third quadrant like in one single shot you get this view of this variable and not only this variable all other variables and the difference is for the numerical variables you can call this statistics in the form of min Max mean median all those things for a categorical variables like gender because mean doesn't make any sense a a minimum maximum doesn't make any sense will give you the count of each category we have three thousand for every lead females and three thousand five and fifty five males in our data set so these functions you should remember or at least keep it handy somewhere summary STR head remember okay explicitly if you want to calculate the mean min Max and Max all these things right so if you see if you pass on these two a vector C maps of C one two three four five million five so if you do max of one column customers $10 monthly charges that will give you the max of that column so that what does that mean can you observe an analogy so every column in a data frame is nothing but a vector so details name is a combination of multiple vectors so match and automatically charges it is giving us so we have a customer who has been like in one seven five dollars bounce I'm not sure but that is the maximum monthly charges in our dataset to run this course he said yesterday so just bring the cursor here anywhere on the line and say control enter and it will automatically detect if it's a multi-line code it will completely run the statement rely on the deep air practice it has an arrange the commercials is arranged so we'll look at that merging again we'll look at deep pockets merging and joining let's try it full out the winner for minimum willing to say am i n wind of customers monthly charges eighteen point oh five is a minimum monthly charges mean is the average all of you know what is the average rate you not explained but I'll cover this mean max and medium a little more detail how you can apply average in different scenarios and why do we really have something called medium if average does everything why do we have media so the range is the middleman member max – min so distance suppose she is giving a million high school then there's something called sample instead of doing ahead and tell you also do a random sampling each mirror so I am very interesting or important now but we'll see how sampling will be relevant when you do this little modeling so let's say if you want to get five five customer IDs at random right latest a sample customers churn dollar customer ID comma five you give you five random records five random customer IDs can we do it on the whole data now we can do so either you do one a column or on the whole on all the columns so something will be very very important in the subsequent sessions we'll see how it is important okay there is a line-by-line code execution it is a librarian code execution okay so there are the function called table they will hand a function to summarize data say if I am saying table customer so this is giving me a summary of how many records we have in female and me so I mean similar chats we can get using different functions there are a lot of repetitions I know but you should know all these functions okay these are row point and column mine are bind and C by so a lot of times you might have to so this is the initial the start of understanding how do you merge two data frames or maybe add one column to the readers memory so are mine and C binder in handy there so what is urban urban trends for open and Ropin means if you already have a data frame how do you add a row to it or if you have to Rita frames how do you just vertically start to data seems so let's create one data frame called student its create one day does them called student how is the student looking like again Sam Bob and I 720 right is the name and marks how did we field we just use the inline human data door frame name equal to see Sam Bob marks equal to see 9725 okay fine now I am converting my name to canopy I can use as dot character I'm sorry I am still reading the we can convert this into a character and we can use as dog creditor for the same same and now I want to add a row or add one more student and and the max is 75 so all I can do is are bind all I have to do is are bind data frame 1 and the second record sorry the new record right if I do this I get sambhaavana right so instead of this this one record you could also add one complete tilde stream legs shouldn't what else we need to complete so this data frame too might have more than one record it's like vertically starting to data frames we can use our mine and so we look at the bigger example where we stand two data frames and if similarly we have one strain data frame and I want to add one more column that is green so in that case it is harder to sagging right so this was my data name and MUX and I want to add one more column grid I can say our bind data frame 1 and the column the vertical in hazard is telling you if we change the original data in spreadsheet and save it does it give away no we have always imported same memory only converts the data type from name is a chemical why do I use it so let me create this student and look at the data types yes er and see what is happening so name is becoming a factor although we didn't 1 name as a factor because see you don't want to measure someone with the name rate yeah if someone is a male and female you can differentiate so very much have two types and character so I wanting to be by default character more on that later we'll discuss in detail I already explained what is the relevance of center but we don't want name as a factory so we just do that okay just a quick info guys if you are interested in doing an end-to-end certification course in data science with our then in telepods provide just the right course for you you can find the course link in the description box below so this brings us to the end of this video if you have any queries please do comment below we would love to help you and also do subscribe to our channel so that you don't miss out on our upcoming videos Oh.