dataset-summary.Rmd
The create_dataset_summary_table
function is a powerful
tool for data analysts who need to quickly assess the structure and
content of their datasets. This vignette will guide you through the
process of using this function and its auxiliary functions to generate a
comprehensive summary of any dataframe.
First, let’s load the necessary libraries and our custom package:
library(vvauditor)
The primary function, create_dataset_summary_table
,
generates a summary statistics table for a given dataframe. Let’s use
the built-in mtcars dataset to see how it works:
summary_table <- create_dataset_summary_table(mtcars)
summary_table
#> # A tibble: 11 × 8
#> column_name column_type missing_value_percentage min_value max_value
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 mpg numeric 0 10.4 33.9
#> 2 cyl numeric 0 4 8
#> 3 disp numeric 0 71.1 472
#> 4 hp numeric 0 52 335
#> 5 drat numeric 0 2.76 4.93
#> 6 wt numeric 0 1.51 5.42
#> 7 qsec numeric 0 14.5 22.9
#> 8 vs numeric 0 0 1
#> 9 am numeric 0 0 1
#> 10 gear numeric 0 3 5
#> 11 carb numeric 0 1 8
#> # ℹ 3 more variables: unique_identifier <lgl>, distribution_statistics <chr>,
#> # distribution_percentages <chr>
The resulting table provides detailed information about each column, such as the data type, missing value percentage, range of values, and more.
The create_dataset_summary_table function relies on several helper functions to perform specific tasks:
get_first_element_class
: Determines the class of the
first element in a vector.find_minimum_value
: Locates the minimum numeric value
in a vector.find_maximum_value
: Finds the maximum numeric value in
a vector.is_unique_column
: Checks whether a column contains
unique values.get_distribution_statistics
: Computes distribution
statistics for numeric vectors.calculate_category_percentages
: Calculates the
percentage of categories in a data vector.Each of these functions plays a crucial role in constructing the final summary table. We will explore each one in more detail in the following sections.
This function helps identify the data type of each column in the dataframe:
## Example usage
get_first_element_class(mtcars$mpg) # "numeric"
#> [1] "numeric"
These functions assist in determining the range of values for numeric columns:
# Example usage
find_minimum_value(mtcars$hp) # Minimum horsepower
#> [1] 52
find_maximum_value(mtcars$disp) # Maximum displacement
#> [1] 472
With this function, you can easily check for uniqueness in a particular column:
# Example usage
is_unique_column("cyl", mtcars) # Are cylinder numbers unique?
#> [1] FALSE
This function provides descriptive statistics for numeric columns:
# Example usage
get_distribution_statistics(mtcars$wt) # Distribution statistics for weight
#> [1] "Q1 = 2.58125 | median = 3.325 | Q3 = 3.61 | mean = 3.22 | stdev = 0.978"
Lastly, this function calculates the percentage breakdown of categories in a data vector:
# Example usage
calculate_category_percentages(mtcars$gear) # Category percentages for gears
#> [1] "3: (46.88%) | 4: (37.50%) | 5: (15.62%)"
The create_dataset_summary_table
function and its
companion functions offer a convenient way to gain immediate insights
into the structure and content of your data. By understanding and
utilizing these tools, you can expedite your data exploration process
and make more informed decisions during your analysis.