Overview

The create_dataset_summary_table function is a powerful tool for data analysts who need to quickly assess the structure and content of their datasets. This vignette will guide you through the process of using this function and its auxiliary functions to generate a comprehensive summary of any dataframe.

Getting Started

First, let’s load the necessary libraries and our custom package:

library(vvauditor)

Creating The Summary Table

The primary function, create_dataset_summary_table, generates a summary statistics table for a given dataframe. Let’s use the built-in mtcars dataset to see how it works:

summary_table <- create_dataset_summary_table(mtcars)
summary_table
#> # A tibble: 11 × 8
#>    column_name column_type missing_value_percentage min_value max_value
#>    <chr>       <chr>                          <dbl>     <dbl>     <dbl>
#>  1 mpg         numeric                            0     10.4      33.9 
#>  2 cyl         numeric                            0      4         8   
#>  3 disp        numeric                            0     71.1     472   
#>  4 hp          numeric                            0     52       335   
#>  5 drat        numeric                            0      2.76      4.93
#>  6 wt          numeric                            0      1.51      5.42
#>  7 qsec        numeric                            0     14.5      22.9 
#>  8 vs          numeric                            0      0         1   
#>  9 am          numeric                            0      0         1   
#> 10 gear        numeric                            0      3         5   
#> 11 carb        numeric                            0      1         8   
#> # ℹ 3 more variables: unique_identifier <lgl>, distribution_statistics <chr>,
#> #   distribution_percentages <chr>

The resulting table provides detailed information about each column, such as the data type, missing value percentage, range of values, and more.

Helper Functions

The create_dataset_summary_table function relies on several helper functions to perform specific tasks:

  • get_first_element_class: Determines the class of the first element in a vector.
  • find_minimum_value: Locates the minimum numeric value in a vector.
  • find_maximum_value: Finds the maximum numeric value in a vector.
  • is_unique_column: Checks whether a column contains unique values.
  • get_distribution_statistics: Computes distribution statistics for numeric vectors.
  • calculate_category_percentages: Calculates the percentage of categories in a data vector.

Each of these functions plays a crucial role in constructing the final summary table. We will explore each one in more detail in the following sections.

get_first_element_class

This function helps identify the data type of each column in the dataframe:

## Example usage

get_first_element_class(mtcars$mpg) # "numeric"
#> [1] "numeric"

find_minimum_value and find_maximum_value

These functions assist in determining the range of values for numeric columns:

# Example usage

find_minimum_value(mtcars$hp) # Minimum horsepower 
#> [1] 52
find_maximum_value(mtcars$disp) # Maximum displacement
#> [1] 472

is_unique_column

With this function, you can easily check for uniqueness in a particular column:

# Example usage

is_unique_column("cyl", mtcars) # Are cylinder numbers unique?
#> [1] FALSE

get_distribution_statistics

This function provides descriptive statistics for numeric columns:

# Example usage

get_distribution_statistics(mtcars$wt) # Distribution statistics for weight
#> [1] "Q1 = 2.58125  | median = 3.325  | Q3 =  3.61  | mean = 3.22  | stdev = 0.978"

calculate_category_percentages

Lastly, this function calculates the percentage breakdown of categories in a data vector:


# Example usage

calculate_category_percentages(mtcars$gear) # Category percentages for gears
#> [1] "3: (46.88%) | 4: (37.50%) | 5: (15.62%)"

Conclusion

The create_dataset_summary_table function and its companion functions offer a convenient way to gain immediate insights into the structure and content of your data. By understanding and utilizing these tools, you can expedite your data exploration process and make more informed decisions during your analysis.