R has wide options for holding data, such as scalars, vectors, matrices, arrays, data frames, and lists. In Data structures in R – Part 1 we have seen scalars, vectors, matrices, arrays. Now let’s see data frames and lists.
Data frames
A data frame is more is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. Here, different columns can contain different modes of data (numeric, character, etc.). It’s similar to the datasets that we see in IBM SPSS, SAS and Stata. Data frames are the most common data structure that is used within R.
Characteristics of a data frame
- Column names should not be empty.
- Row names should be unique.
- Data stored in a data frame can be of numeric, factor or character type.
- Each column should contain the same number of data items.
Let’s see some example,
> marklist<-data.frame( + rollno = c(1001:1006), + name = c("Abdul","Balu","Charlie","Daniel","Elisa","Fathima"), + marks = c(87,91,66,57,83,72) + ) > marklist rollno name marks 1 1001 Abdul 87 2 1002 Balu 91 3 1003 Charlie 66 4 1004 Daniel 57 5 1005 Elisa 83 6 1006 Fathima 72
In the above example, you can observe that each column must have only one data type but you can have different columns inside the data frame with the different data type.
We can subscript data frame like the way we subscript matrices. Let’s see this with an example with the above used marklist
dataset.
> marklist[1,3] [1] 87 > marklist[1:3] rollno name marks 1 1001 Abdul 87 2 1002 Balu 91 3 1003 Charlie 66 4 1004 Daniel 57 5 1005 Elisa 83 6 1006 Fathima 72 > marklist[c(1,3)] rollno marks 1 1001 87 2 1002 91 3 1003 66 4 1004 57 5 1005 83 6 1006 72 > marklist[c("rollno","marks")] rollno marks 1 1001 87 2 1002 91 3 1003 66 4 1004 57 5 1005 83 6 1006 72 > marklist$name [1] "Abdul" "Balu" "Charlie" "Daniel" [5] "Elisa" "Fathima"
Factors
Factors are used to categorize the data and store it as levels. They can store both strings and integers. This is useful in the columns which have a limited number of unique values. For example, Male, Female, Neutral and True, False etc. They are useful in data analysis for statistical modelling. Factors are created using the factor ()
function by taking a vector as input.
We will see more about factors practically when we discuss about statistical methods.