First steps

Installation

Installing the latest stable version of Cleaner is as simple as using:

import Pkg
Pkg.add("Cleaner")

After installation has finished, you just need to call using Cleaner to get all Cleaner functionalities in your current namespace.

About the CleanTable type

A CleanTable is meant to represent data in a tabular format, being column based by design, while also being the type where all Cleaner functions do their work.

It implements the Tables.jl interface and the constructor can create a CleanTable from any Tables.jl implementation.

julia> using DataFrames

julia> df = DataFrame(A = Any[1, 2, 3, 4], B = Any["M", "F", "F", "M"])
4×2 DataFrame
 Row │ A    B
     │ Any  Any
─────┼──────────
   1 │ 1    M
   2 │ 2    F
   3 │ 3    F
   4 │ 4    M

julia> using Cleaner

julia> ct = CleanTable(df)
┌─────┬─────┐
│   A │   B │
│ Any │ Any │
├─────┼─────┤
│   1 │   M │
│   2 │   F │
│   3 │   F │
│   4 │   M │
└─────┴─────┘

If the original Tables.jl implementation (source) you were using supports constructing the source type from any Tables.jl implementation, getting back to using an object of your source type is as easy as calling its constructor after you have finished working with Cleaner.

julia> reinfer_schema!(ct)
┌───────┬────────┐
│     A │      B │
│ Int64 │ String │
├───────┼────────┤
│     1 │      M │
│     2 │      F │
│     3 │      F │
│     4 │      M │
└───────┴────────┘


julia> DataFrame(ct)
4×2 DataFrame
 Row │ A      B
     │ Int64  String
─────┼───────────────
   1 │     1  M
   2 │     2  F
   3 │     3  F
   4 │     4  M

All Cleaner functions support piping too so the code above could be rewritten as this:

julia> df |> CleanTable |> reinfer_schema! |> DataFrame
4×2 DataFrame
 Row │ A      B
     │ Int64  String
─────┼───────────────
   1 │     1  M
   2 │     2  F
   3 │     3  F
   4 │     4  M

By default the CleanTable constructor when called with a table as only argument will copy the columns instead of using directly the source columns. This behavior can be overwritten by explicitly passing the copycols=false keyword argument.

julia> ct = CleanTable(df)
┌─────┬─────┐
│   A │   B │
│ Any │ Any │
├─────┼─────┤
│   1 │   M │
│   2 │   F │
│   3 │   F │
│   4 │   M │
└─────┴─────┘


julia> ct.A[1] = 5
5

julia> ct
┌─────┬─────┐
│   A │   B │
│ Any │ Any │
├─────┼─────┤
│   5 │   M │
│   2 │   F │
│   3 │   F │
│   4 │   M │
└─────┴─────┘


julia> df
4×2 DataFrame
 Row │ A    B
     │ Any  Any
─────┼──────────
   1 │ 1    M
   2 │ 2    F
   3 │ 3    F
   4 │ 4    M

julia> ct = CleanTable(df; copycols=false)
┌─────┬─────┐
│   A │   B │
│ Any │ Any │
├─────┼─────┤
│   1 │   M │
│   2 │   F │
│   3 │   F │
│   4 │   M │
└─────┴─────┘


julia> ct.A[1] = 5;

julia> ct
┌─────┬─────┐
│   A │   B │
│ Any │ Any │
├─────┼─────┤
│   5 │   M │
│   2 │   F │
│   3 │   F │
│   4 │   M │
└─────┴─────┘


julia> df
4×2 DataFrame
 Row │ A    B
     │ Any  Any
─────┼──────────
   1 │ 5    M
   2 │ 2    F
   3 │ 3    F
   4 │ 4    M

Accessing columns

If you want to access an specific column, CleanTable supports access by column index and column name.

julia> ct = CleanTable([:A, :B], [[1, 2, 3, 4], ["M", "F", "F", "M"]])
┌───────┬────────┐
│     A │      B │
│ Int64 │ String │
├───────┼────────┤
│     1 │      M │
│     2 │      F │
│     3 │      F │
│     4 │      M │
└───────┴────────┘


julia> ct.A
4-element Vector{Int64}:
 1
 2
 3
 4

julia> ct[1]
4-element Vector{Int64}:
 1
 2
 3
 4

As the result of accessing a column in a CleanTable is the column itself, if you want to reasign values in a column you can just modify the accessed result.

E.g:

julia> ct.A = [5, 6, 7, 8]
4-element Vector{Int64}:
 5
 6
 7
 8

julia> ct
┌───────┬────────┐
│     A │      B │
│ Int64 │ String │
├───────┼────────┤
│     5 │      M │
│     6 │      F │
│     7 │      F │
│     8 │      M │
└───────┴────────┘

Warning

Adding/removing rows or columns without using Cleaner.jl functions is not supported and heavily discouraged. Please refer to other packages such as DataFrames.jl for those needs.