Data in R is held as a wide variety of objects such as vectors, matrices, arrays, data frames, and lists. Let’s see How to create or assign a data frame in R.
What is Data frame?
A data frame is more general than a matrix. Here different columns can contain different modes of data (numeric, character, etc.). It’s similar to the datasets you typically see and use in statistical packages like IBM SPSS, SAS, and Stata. Data frames are the most common data structure that we deal in R. A data frame is created with the function data.frame()
. The general format is
mydata <- data.frame(col1, col2, col3,…)
where col1, col2, col3, …
are column vectors of any type (such as character, numeric, or logical). Names for each column can be provided with the names
function. The following example will help you understand better.
Example
Let us create students database
> rollno<- c(1201,1202,1203,1204,1205) > age <- c(19,20,22,19,20) > results <- c("Pass","Pass","Fail","Pass","Fail") > gender <- c("Male","Female","Male","Male","Female") > students <- data.frame(rollno,age,gender,results) > students rollno age gender results 1 1201 19 Male Pass 2 1202 20 Female Pass 3 1203 22 Male Fail 4 1204 19 Male Pass 5 1205 20 Female Fail
Points to remember, each column must have only one data type, but you can put columns of different data type together to form the data frame. Because data frames are close to what analysts typically think of as datasets, we’ll use the terms columns and variables interchangeably when discussing data frames. There are several ways to identify the elements of a data frame.
You can use the subscript notation or you can specify column names. Using the students
data frame in the above example, the following illustration will help in understanding.
Three ways to Subset a data frame in R Examples
In this example lets subset, Rollno and Results column from above data frame.
> students[c(1,4)] rollno results 1 1201 Pass 2 1202 Pass 3 1203 Fail 4 1204 Pass 5 1205 Fail > students[c("rollno","results")] rollno results 1 1201 Pass 2 1202 Pass 3 1203 Fail 4 1204 Pass 5 1205 Fail > students$age [1] 19 20 22 19 20
The $
notation in the third example is used to indicate a particular variable from a given data frame. For example, if you want to cross-tabulate students gender by results, you could use the following code
> table(students$gender,students$results) Fail Pass Female 1 1 Male 1 2
You may get tired typing students$
at the beginning of every variable name. So, R has shortcuts for that too. You can use either the attach()
and detach()
or with()
functions to simplify your code.