The `data.table`

package is one of the fastest packages for data manipulation, currently, it is even faster than `pandas`

and `dplyr`

^{1}. `data.table`

syntax is `dt[i, j, by]`

, where:

- i is used to subset rows
- j is used to subset columns
- by is used to subset groups, like
**GROUP BY**from SQL

You can read it *out loud* as^{2}:

Take

`dt`

, subset/reorder rows using`i`

, then calculate`j`

, grouped by`by`

.

A `data.table`

is also a `data.frame`

and all of the basic data manipulations you can use in `data.frame`

s applies to `data.table`

. Like `ncol()`

, `nrow()`

, `names()`

, `summary()`

. But it has more possibilities, for instance in `data.table`

there is a special variable `.N`

which is an integer that contains the row number in the group. If you use `dt[.N]`

you’ll get the last row of your `data.table`

.

Another cool feature of `data.table`

is that if you want filter/subset a column you don’t need to use `df$x[df$x == 1]`

you can simple use `dt[x == 1]`

which make your code much more readable and clean.

You also get to use special operators: `%like%`

, `%in%`

and `%between%`

. These operators work like SQL operators, **LIKE**, **IN**, and **BETWEEN**, respectively.

If you are familiar with SQL there is this one thing the package offers that will catch your eye. It’s called *chaining*, which allows you to perform a sequence of operations in a `data.table`

, you just need to use `dt[][]`

and chain multiple operations “[]”.

But that’s not all with the operator `:=`

you can alter data without making a new copy in the memory.

If you want to start using the package I suggest you use the cheatsheet. It’s really useful if you already have a basic knowledge about `data.frame`

s.