Let’s use tidypolars more

Published

June 26, 2022

Let’s use tidypolars more

In this blogpost I’m going to show how to perform the same task with pandas, the most popular library for data analysis in Python, and tidypolars, a new library to do data analysis with tabular data inspired on the tidyverse.

The task consists of computing a variable transformation that relies on grouped aggregations. In particular, we will be computing the standardized version of a numeric variable by group.

import palmerpenguins

import tidypolars as tp

from tidypolars import col

I’m going to work with the famous palmer penguins dataset. In Python this can be loaded very easily thanks to the palmerpenguins library.

data = palmerpenguins.load_penguins()
data
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
... ... ... ... ... ... ... ... ...
339 Chinstrap Dream 55.8 19.8 207.0 4000.0 male 2009
340 Chinstrap Dream 43.5 18.1 202.0 3400.0 female 2009
341 Chinstrap Dream 49.6 18.2 193.0 3775.0 male 2009
342 Chinstrap Dream 50.8 19.0 210.0 4100.0 male 2009
343 Chinstrap Dream 50.2 18.7 198.0 3775.0 female 2009

344 rows × 8 columns

Data wrangling with pandas

Before seeing an example using tidypolars I’m going to perform some basic data wrangling with pandas. One of the first things one usually do with a data frame is exploring its first rows visually. The .head() method comes very handy.

data.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007

I’m going to work with the species, sex, and body_mass_g columns only. To subset the dataframe we only need to pass the name of these columns as a list within brackets.

data = data[["species", "sex", "body_mass_g"]].reset_index(drop=True)
data.head()
species sex body_mass_g
0 Adelie male 3750.0
1 Adelie female 3800.0
2 Adelie female 3250.0
3 Adelie NaN NaN
4 Adelie female 3450.0

I add the .reset_index(drop=True) to avoid some SettingWithCopyWarnings later. I also want to drop any observations with missing values so I use .dropna() to do that.

data = data.dropna()

Suppose now I want to standardize the variable body_mass_g. Pandas vectorized operations make it extremely easy. One can save the result in a new column in a very intuitive way as well.

data["body_mass_z"] = (data["body_mass_g"] - data["body_mass_g"].mean()) / data["body_mass_g"].std()
data.head()
species sex body_mass_g body_mass_z
0 Adelie male 3750.0 -0.567621
1 Adelie female 3800.0 -0.505525
2 Adelie female 3250.0 -1.188572
4 Adelie female 3450.0 -0.940192
5 Adelie male 3650.0 -0.691811

We could consider it is more appropiate to standardize considering the species variable. We still perform the same operation than above, but we do it within each group.

The way to perform such operations in pandas is to use the .groupby() method. Then one can select the desired column and compute the aggregation.

For example, to compute the mean body mass by species we can do the following.

data.groupby("species")["body_mass_g"].mean()
species
Adelie       3706.164384
Chinstrap    3733.088235
Gentoo       5092.436975
Name: body_mass_g, dtype: float64

We obtained a pandas Series with three values, the mean for each species. If we want to obtain a Series of the same length than the original data, it is, for each row the mean of the species the observation belongs to, we can use .transform().

data.groupby("species")["body_mass_g"].transform("mean")
0      3706.164384
1      3706.164384
2      3706.164384
4      3706.164384
5      3706.164384
          ...     
339    3733.088235
340    3733.088235
341    3733.088235
342    3733.088235
343    3733.088235
Name: body_mass_g, Length: 333, dtype: float64

The same can be done with other transformations such as the standard deviation.

data.groupby("species")["body_mass_g"].transform("std")
0      458.620135
1      458.620135
2      458.620135
4      458.620135
5      458.620135
          ...    
339    384.335081
340    384.335081
341    384.335081
342    384.335081
343    384.335081
Name: body_mass_g, Length: 333, dtype: float64

Now, putting all the pieces together, we can compute the standardized body mass by species.

data["body_mass_z"] = (
    (data["body_mass_g"] - data.groupby("species")["body_mass_g"].transform("mean")) 
    / data.groupby("species")["body_mass_g"].transform("std")
)
data.head()
species sex body_mass_g body_mass_z
0 Adelie male 3750.0 0.095582
1 Adelie female 3800.0 0.204604
2 Adelie female 3250.0 -0.994645
4 Adelie female 3450.0 -0.558555
5 Adelie male 3650.0 -0.122464

It’s also possible to add more variables to the groups. For example, this is how we can perform the same compution considering groups given by species and sex.

data["body_mass_z"] = (
    (data["body_mass_g"] - data.groupby(["species", "sex"])["body_mass_g"].transform("mean")) 
    / data.groupby(["species", "sex"])["body_mass_g"].transform("std")
)
data.head()
species sex body_mass_g body_mass_z
0 Adelie male 3750.0 -0.846261
1 Adelie female 3800.0 1.600580
2 Adelie female 3250.0 -0.441145
4 Adelie female 3450.0 0.301301
5 Adelie male 3650.0 -1.134602

And finally, I can sort by the standardized body mass in ascending order.

data.sort_values("body_mass_z")
species sex body_mass_g body_mass_z
314 Chinstrap female 2700.0 -2.899080
192 Gentoo female 3950.0 -2.591611
195 Gentoo male 4750.0 -2.346530
298 Chinstrap female 2900.0 -2.198147
119 Adelie male 3325.0 -2.071711
... ... ... ... ...
114 Adelie female 3900.0 1.971803
109 Adelie male 4775.0 2.109234
284 Chinstrap female 4150.0 2.182685
313 Chinstrap male 4800.0 2.377631
169 Gentoo male 6300.0 2.603039

333 rows × 4 columns

Data wrangling with tidypolars

Now it’s time to see the same operations performed with tidypolars. Tidypolars is inspired on the tidyverse, a set of packages following a consistent design phillosophy that has revolutionated the way we do data science in R and other languages as well.

Its description says > tidypolars is a data frame library built on top of the blazingly fast polars library that gives access to methods and functions familiar to R tidyverse users.

Tidypolars does not rely on any pandas data structure because it’s built on top of polars, not pandas. In addition, it works with a new data frame structure called Tibble, borrowing its name from the tibble in the R package of the same name.

We can convert a pandas DataFrame to a tibble with the .from_pandas() function.

tibble = tp.from_pandas(palmerpenguins.load_penguins())
type(tibble)
tidypolars.tibble.Tibble

We still have a .head() method that prints the first rows. The representation is very similar to a pandas data frame.

tibble.head()
shape: (5, 8)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
str str f64 f64 f64 f64 str i64
"Adelie" "Torgersen" 39.1 18.7 181.0 3750.0 "male" 2007
"Adelie" "Torgersen" 39.5 17.4 186.0 3800.0 "female" 2007
"Adelie" "Torgersen" 40.3 18.0 195.0 3250.0 "female" 2007
"Adelie" "Torgersen" null null null null null 2007
"Adelie" "Torgersen" 36.7 19.3 193.0 3450.0 "female" 2007

To filter rows we can use the .filter() method. In the next chunk of code we use col("*") to select all columns, tp.is_not_null() to flag observations with non-null values, and .filter() to use those booleans to actually perform the filtering.

Update: See this issue where the main author of tidypolars lets me know that we can use the .drop_null() method instead.

tibble.drop_null()
shape: (333, 8)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
str str f64 f64 f64 f64 str i64
"Adelie" "Torgersen" 39.1 18.7 181.0 3750.0 "male" 2007
"Adelie" "Torgersen" 39.5 17.4 186.0 3800.0 "female" 2007
"Adelie" "Torgersen" 40.3 18.0 195.0 3250.0 "female" 2007
"Adelie" "Torgersen" 36.7 19.3 193.0 3450.0 "female" 2007
"Adelie" "Torgersen" 39.3 20.6 190.0 3650.0 "male" 2007
"Adelie" "Torgersen" 38.9 17.8 181.0 3625.0 "female" 2007
"Adelie" "Torgersen" 39.2 19.6 195.0 4675.0 "male" 2007
"Adelie" "Torgersen" 41.1 17.6 182.0 3200.0 "female" 2007
"Adelie" "Torgersen" 38.6 21.2 191.0 3800.0 "male" 2007
"Adelie" "Torgersen" 34.6 21.1 198.0 4400.0 "male" 2007
"Adelie" "Torgersen" 36.6 17.8 185.0 3700.0 "female" 2007
"Adelie" "Torgersen" 38.7 19.0 195.0 3450.0 "female" 2007
... ... ... ... ... ... ... ...
"Chinstrap" "Dream" 45.2 16.6 191.0 3250.0 "female" 2009
"Chinstrap" "Dream" 49.3 19.9 203.0 4050.0 "male" 2009
"Chinstrap" "Dream" 50.2 18.8 202.0 3800.0 "male" 2009
"Chinstrap" "Dream" 45.6 19.4 194.0 3525.0 "female" 2009
"Chinstrap" "Dream" 51.9 19.5 206.0 3950.0 "male" 2009
"Chinstrap" "Dream" 46.8 16.5 189.0 3650.0 "female" 2009
"Chinstrap" "Dream" 45.7 17.0 195.0 3650.0 "female" 2009
"Chinstrap" "Dream" 55.8 19.8 207.0 4000.0 "male" 2009
"Chinstrap" "Dream" 43.5 18.1 202.0 3400.0 "female" 2009
"Chinstrap" "Dream" 49.6 18.2 193.0 3775.0 "male" 2009
"Chinstrap" "Dream" 50.8 19.0 210.0 4100.0 "male" 2009
"Chinstrap" "Dream" 50.2 18.7 198.0 3775.0 "female" 2009

It’s very easy to select columns with the .select() method.

tibble.select("species", "sex", "body_mass_g")
shape: (344, 3)
species sex body_mass_g
str str f64
"Adelie" "male" 3750.0
"Adelie" "female" 3800.0
"Adelie" "female" 3250.0
"Adelie" null null
"Adelie" "female" 3450.0
"Adelie" "male" 3650.0
"Adelie" "female" 3625.0
"Adelie" "male" 4675.0
"Adelie" null 3475.0
"Adelie" null 4250.0
"Adelie" null 3300.0
"Adelie" null 3700.0
... ... ...
"Chinstrap" "female" 3250.0
"Chinstrap" "male" 4050.0
"Chinstrap" "male" 3800.0
"Chinstrap" "female" 3525.0
"Chinstrap" "male" 3950.0
"Chinstrap" "female" 3650.0
"Chinstrap" "female" 3650.0
"Chinstrap" "male" 4000.0
"Chinstrap" "female" 3400.0
"Chinstrap" "male" 3775.0
"Chinstrap" "male" 4100.0
"Chinstrap" "female" 3775.0

What’s even better, we can chain these operations. This is where one can start seeing how powerful this approach is.

(
    tibble
    .drop_null()
    .select("species", "sex", "body_mass_g")
)
shape: (333, 3)
species sex body_mass_g
str str f64
"Adelie" "male" 3750.0
"Adelie" "female" 3800.0
"Adelie" "female" 3250.0
"Adelie" "female" 3450.0
"Adelie" "male" 3650.0
"Adelie" "female" 3625.0
"Adelie" "male" 4675.0
"Adelie" "female" 3200.0
"Adelie" "male" 3800.0
"Adelie" "male" 4400.0
"Adelie" "female" 3700.0
"Adelie" "female" 3450.0
... ... ...
"Chinstrap" "female" 3250.0
"Chinstrap" "male" 4050.0
"Chinstrap" "male" 3800.0
"Chinstrap" "female" 3525.0
"Chinstrap" "male" 3950.0
"Chinstrap" "female" 3650.0
"Chinstrap" "female" 3650.0
"Chinstrap" "male" 4000.0
"Chinstrap" "female" 3400.0
"Chinstrap" "male" 3775.0
"Chinstrap" "male" 4100.0
"Chinstrap" "female" 3775.0

We then have the .summarise() method to compute summaries (or aggregations) by groups. Note we use functions available in the tidypolars namespace such as tp.mean().

(
    tibble
    .drop_null()
    .select("species", "sex", "body_mass_g")
    .summarise(
        body_mass_mean=tp.mean("body_mass_g"),
        by="species"
    )
)
shape: (3, 2)
species body_mass_mean
str f64
"Adelie" 3706.164384
"Gentoo" 5092.436975
"Chinstrap" 3733.088235

If we want to get a behavior similar to .groupby() and .transform() in pandas, we can use another verb, mutate().

(
    tibble
    .drop_null()
    .select("species", "sex", "body_mass_g")
    .mutate(
        body_mass_mean=tp.mean("body_mass_g"),
        by="species"
    )
)
shape: (333, 4)
species sex body_mass_g body_mass_mean
str str f64 f64
"Gentoo" "female" 4500.0 5092.436975
"Gentoo" "male" 5700.0 5092.436975
"Gentoo" "female" 4450.0 5092.436975
"Gentoo" "male" 5700.0 5092.436975
"Gentoo" "male" 5400.0 5092.436975
"Gentoo" "female" 4550.0 5092.436975
"Gentoo" "female" 4800.0 5092.436975
"Gentoo" "male" 5200.0 5092.436975
"Gentoo" "female" 4400.0 5092.436975
"Gentoo" "male" 5150.0 5092.436975
"Gentoo" "female" 4650.0 5092.436975
"Gentoo" "male" 5550.0 5092.436975
... ... ... ...
"Adelie" "female" 3400.0 3706.164384
"Adelie" "male" 3475.0 3706.164384
"Adelie" "female" 3050.0 3706.164384
"Adelie" "male" 3725.0 3706.164384
"Adelie" "female" 3000.0 3706.164384
"Adelie" "male" 3650.0 3706.164384
"Adelie" "male" 4250.0 3706.164384
"Adelie" "female" 3475.0 3706.164384
"Adelie" "female" 3450.0 3706.164384
"Adelie" "male" 3750.0 3706.164384
"Adelie" "female" 3700.0 3706.164384
"Adelie" "male" 4000.0 3706.164384

This verb is very powerful. We can compute more complex expressions, such as the one involved in the standardization of a variable.

(
    tibble
    .drop_null()
    .select("species", "sex", "body_mass_g")
    .mutate(
        body_mass_z=(col("body_mass_g") - tp.mean("body_mass_g")) / tp.sd("body_mass_g"),
        by="species"
    )
)
shape: (333, 4)
species sex body_mass_g body_mass_z
str str f64 f64
"Gentoo" "female" 4500.0 -1.181386
"Gentoo" "male" 5700.0 1.211549
"Gentoo" "female" 4450.0 -1.281092
"Gentoo" "male" 5700.0 1.211549
"Gentoo" "male" 5400.0 0.613315
"Gentoo" "female" 4550.0 -1.0816
"Gentoo" "female" 4800.0 -0.583152
"Gentoo" "male" 5200.0 0.214493
"Gentoo" "female" 4400.0 -1.380797
"Gentoo" "male" 5150.0 0.114787
"Gentoo" "female" 4650.0 -0.882269
"Gentoo" "male" 5550.0 0.912432
... ... ... ...
"Adelie" "female" 3400.0 -0.667577
"Adelie" "male" 3475.0 -0.504043
"Adelie" "female" 3050.0 -1.430736
"Adelie" "male" 3725.0 0.041
"Adelie" "female" 3000.0 -1.539759
"Adelie" "male" 3650.0 -0.122464
"Adelie" "male" 4250.0 1.185808
"Adelie" "female" 3475.0 -0.504043
"Adelie" "female" 3450.0 -0.558555
"Adelie" "male" 3750.0 0.095582
"Adelie" "female" 3700.0 -0.013441
"Adelie" "male" 4000.0 0.640695

And finally, we can use the .arrange() verb to sort observations by the standardized body mass in ascending order.

(
    tibble
    .drop_null()
    .select("species", "sex", "body_mass_g")
    .mutate(
        body_mass_z=(col("body_mass_g") - tp.mean("body_mass_g")) / tp.sd("body_mass_g"),
        by="species"
    )
    .arrange("body_mass_z")
)
shape: (333, 4)
species sex body_mass_g body_mass_z
str str f64 f64
"Chinstrap" "female" 2700.0 -2.687988
"Gentoo" "female" 3950.0 -2.278148
"Chinstrap" "female" 2900.0 -2.167609
"Gentoo" "female" 4150.0 -1.879326
"Adelie" "female" 2850.0 -1.866827
"Adelie" "female" 2850.0 -1.866827
"Gentoo" "female" 4200.0 -1.7796
"Gentoo" "female" 4200.0 -1.7796
"Gentoo" "female" 4200.0 -1.7796
"Adelie" "female" 2900.0 -1.757804
"Adelie" "female" 2900.0 -1.757804
"Adelie" "female" 2900.0 -1.757804
... ... ... ...
"Gentoo" "male" 6050.0 1.909489
"Adelie" "male" 4600.0 1.948967
"Adelie" "male" 4600.0 1.948967
"Chinstrap" "male" 4500.0 1.995425
"Adelie" "male" 4650.0 2.0579
"Adelie" "male" 4675.0 2.112501
"Chinstrap" "male" 4550.0 2.1255
"Adelie" "male" 4700.0 2.167013
"Adelie" "male" 4725.0 2.221524
"Adelie" "male" 4775.0 2.330547
"Gentoo" "male" 6300.0 2.408017
"Chinstrap" "male" 4800.0 2.775994

We can make it even clearer if we wrap the standardization operation within a function.

def standardize(name):
    return (col(name) - tp.mean(name)) / tp.sd(name)

(
    tibble
    .drop_null()
    .select("species", "sex", "body_mass_g")
    .mutate(
        body_mass_z=standardize("body_mass_g"),
        by="species"
    )
    .arrange("body_mass_z")
)
shape: (333, 4)
species sex body_mass_g body_mass_z
str str f64 f64
"Chinstrap" "female" 2700.0 -2.687988
"Gentoo" "female" 3950.0 -2.278148
"Chinstrap" "female" 2900.0 -2.167609
"Gentoo" "female" 4150.0 -1.879326
"Adelie" "female" 2850.0 -1.866827
"Adelie" "female" 2850.0 -1.866827
"Gentoo" "female" 4200.0 -1.7796
"Gentoo" "female" 4200.0 -1.7796
"Gentoo" "female" 4200.0 -1.7796
"Adelie" "female" 2900.0 -1.757804
"Adelie" "female" 2900.0 -1.757804
"Adelie" "female" 2900.0 -1.757804
... ... ... ...
"Gentoo" "male" 6050.0 1.909489
"Adelie" "male" 4600.0 1.948967
"Adelie" "male" 4600.0 1.948967
"Chinstrap" "male" 4500.0 1.995425
"Adelie" "male" 4650.0 2.0579
"Adelie" "male" 4675.0 2.112501
"Chinstrap" "male" 4550.0 2.1255
"Adelie" "male" 4700.0 2.167013
"Adelie" "male" 4725.0 2.221524
"Adelie" "male" 4775.0 2.330547
"Gentoo" "male" 6300.0 2.408017
"Chinstrap" "male" 4800.0 2.775994

If we want to carry on the .mutate() operation grouped by more than one variable, we can simply pass the names in a list.

Comparison

Let’s see both approaches in action so we can better appreciate the differences.

With pandas

data = palmerpenguins.load_penguins()
data = data[["species", "sex", "body_mass_g"]].dropna().reset_index(drop=True)
data["body_mass_z"] = (
    (data["body_mass_g"] - data.groupby(["species", "sex"])["body_mass_g"].transform("mean")) 
    / data.groupby(["species", "sex"])["body_mass_g"].transform("std")
)
data.sort_values("body_mass_z")
species sex body_mass_g body_mass_z
303 Chinstrap female 2700.0 -2.899080
185 Gentoo female 3950.0 -2.591611
188 Gentoo male 4750.0 -2.346530
287 Chinstrap female 2900.0 -2.198147
113 Adelie male 3325.0 -2.071711
... ... ... ... ...
108 Adelie female 3900.0 1.971803
103 Adelie male 4775.0 2.109234
273 Chinstrap female 4150.0 2.182685
302 Chinstrap male 4800.0 2.377631
163 Gentoo male 6300.0 2.603039

333 rows × 4 columns

With tidypolars

def standardize(name):
    return (col(name) - tp.mean(name)) / tp.sd(name)
tibble = tp.from_pandas(palmerpenguins.load_penguins())
(
    tibble
    .drop_null()
    .select("species", "sex", "body_mass_g")
    .mutate(
        body_mass_z=standardize("body_mass_g"),
        by=["species", "sex"]
    )
    .arrange("body_mass_z")
)
shape: (333, 4)
species sex body_mass_g body_mass_z
str str f64 f64
"Chinstrap" "female" 2700.0 -2.899
"Gentoo" "female" 3950.0 -2.591611
"Gentoo" "male" 4750.0 -2.3465
"Chinstrap" "female" 2900.0 -2.198147
"Adelie" "male" 3325.0 -2.071711
"Adelie" "female" 2850.0 -1.926035
"Adelie" "female" 2850.0 -1.926035
"Chinstrap" "male" 3250.0 -1.902511
"Gentoo" "female" 4150.0 -1.881329
"Gentoo" "male" 4925.0 -1.787708
"Adelie" "male" 3425.0 -1.7833
"Chinstrap" "male" 3300.0 -1.764442
... ... ... ...
"Adelie" "male" 4650.0 1.748808
"Gentoo" "male" 6050.0 1.804721
"Adelie" "male" 4675.0 1.820893
"Gentoo" "female" 5200.0 1.847652
"Gentoo" "female" 5200.0 1.847652
"Adelie" "male" 4700.0 1.892979
"Adelie" "male" 4725.0 1.965064
"Adelie" "female" 3900.0 1.971803
"Adelie" "male" 4775.0 2.109234
"Chinstrap" "female" 4150.0 2.182685
"Chinstrap" "male" 4800.0 2.377631
"Gentoo" "male" 6300.0 2.603039

Conclusion

Pandas is great, so many people love it, and it will be the most used tool to work with tabular data in Python for a long time. In my case, it made me struggle a lot when I started to do data analysis in Python after working several years with R. Later, and fortunately, I began to understand how it worked and I became better at it.

However, I never felt as comfortable with pandas as I do with dplyr syntax. Maybe it’s my R background? I don’t know. I just know that one of the things I enjoy the most in R is how easy and clear is to compose data manipulation operations with dplyr and friends. And having that expressiveness in Python is fantastic.