Exploring your table

Am I seeing double?

Tables can usualy have values in a column or columns that are supposed to be unique, but often are not. Primary keys from a table in a database are the most common example of this cases.

For when you want to find out what values (or combinations) are being duplicated on your table we have the get_all_repeated function.

julia> using DataFrames: DataFrame

julia> df = DataFrame(:A => ["y", "x", "y"], :B => ["x", "x", "x"]) 
3×2 DataFrame
 Row │ A       B
     │ String  String
─────┼────────────────
   1 │ y       x
   2 │ x       x
   3 │ y       x

julia> using Cleaner

julia> get_all_repeated(df, [:A])
┌───────────┬────────┐
│ row_index │      A │
│     Int64 │ String │
├───────────┼────────┤
│         1 │      y │
│         3 │      y │
└───────────┴────────┘


julia> get_all_repeated(df, [:A, :B])
┌───────────┬────────┬────────┐
│ row_index │      A │      B │
│     Int64 │ String │ String │
├───────────┼────────┼────────┤
│         1 │      y │      x │
│         3 │      y │      x │
└───────────┴────────┴────────┘

How much of each?

When you are working with categorical data, you might want to know what percentage of the total each category is representing.

For those cases we got the category_distribution function.

julia> category_distribution(df, [:A])
┌─────────────┬─────────┐
│       value │ percent │
│ Vector{Any} │ Float64 │
├─────────────┼─────────┤
│    Any["y"] │    66.7 │
│    Any["x"] │    33.3 │
└─────────────┴─────────┘

More than one column name can be passed in case your category is made of multiple columns.

julia> category_distribution(df, [:A, :B])
┌───────────────┬─────────┐
│         value │ percent │
│   Vector{Any} │ Float64 │
├───────────────┼─────────┤
│ Any["y", "x"] │    66.7 │
│ Any["x", "x"] │    33.3 │
└───────────────┴─────────┘

Shouldn't this match?

When working with multiple tables you might try to do joins and have them fail because there were different column names or schemas between them.

To help you identify these problems we got the compare_table_columns function.

julia> df = DataFrame(:A => ["y", "x", "y"], :B => ["x", "x", "x"])
3×2 DataFrame
 Row │ A       B
     │ String  String
─────┼────────────────
   1 │ y       x
   2 │ x       x
   3 │ y       x

julia> df2 = DataFrame(:A => ["y", "x", "y"], :B => [1, 2, 3], :C => [4.0, 5.0, 6.0])
3×3 DataFrame
 Row │ A       B      C
     │ String  Int64  Float64
─────┼────────────────────────
   1 │ y           1      4.0
   2 │ x           2      5.0
   3 │ y           3      6.0

julia> compare_table_columns(df, df2)
┌─────────────┬─────────┬─────────┐
│ column_name │    tbl1 │    tbl2 │
│      Symbol │    Type │    Type │
├─────────────┼─────────┼─────────┤
│           A │  String │  String │
│           B │  String │   Int64 │
│           C │ Nothing │ Float64 │
└─────────────┴─────────┴─────────┘

You can pass any number of tables to compare and its number of rows can be different between each and the other.

julia> df3 = DataFrame(:D => [:x])
1×1 DataFrame
 Row │ D
     │ Symbol
─────┼────────
   1 │ x

julia> compare_table_columns(df, df2, df3)
┌─────────────┬─────────┬─────────┬─────────┐
│ column_name │    tbl1 │    tbl2 │    tbl3 │
│      Symbol │    Type │    Type │    Type │
├─────────────┼─────────┼─────────┼─────────┤
│           A │  String │  String │ Nothing │
│           B │  String │   Int64 │ Nothing │
│           C │ Nothing │ Float64 │ Nothing │
│           D │ Nothing │ Nothing │  Symbol │
└─────────────┴─────────┴─────────┴─────────┘