Working with tabular data: Pandas

Working with tabular data: Pandas#

Pandas provides three things:

a new data type specifically designed for tabular data - the DataFrame
functions for manipulating tabular data
IO for tabular data (covered in Week3/Files)

Why yet another data type?#

Numpy arrays can hold tabular data - 2D matrices. So why do we need another, special data type? One problem with raw 2D arrays is that they are not self-documenting. What does that mean?

Take the data from exercise of Week3/Files: We calculated behavioral scores, stored them in a numpy array, and saved the array to a text file:

import numpy as np
# load and print matrix
scores = np.loadtxt('dat/pd_scores.txt')
scores

array([[0.08656877, 0.26186048],
       [0.27081788, 0.48948194],
       [0.47072199, 0.65700287],
       [0.55996639, 0.81680243],
       [0.84880959, 0.98686545],
       [0.91045131, 0.99465925]])

Do you remember what the data means? What are the individual rows? What are the columns? Imagine this were an analysis on your own data and you’d want to look at the results in 3 months or so.

There are no labels, so it’s hard to know from the data itself what they mean - the data is not “self-documenting”

We can turn this into a DataFrame:

import pandas as pd

# the folder names correspond to the days
days = ['20230912','20230913','20230914','20230915','20230916','20230917']

# make DataFrame - no need to understand the specifics - I will explain soon what is going on here
# we can easily turn a dictionary into a DataFrame - keys will be column labels, values are the per-row data
# we can also specify row labels - an index.
scores_dict = {'intial score': scores[:, 0], 'final score': scores[:, 1]}
df = pd.DataFrame(scores_dict, index=days)
df

	intial score	final score
20230912	0.086569	0.261860
20230913	0.270818	0.489482
20230914	0.470722	0.657003
20230915	0.559966	0.816802
20230916	0.848810	0.986865
20230917	0.910451	0.994659

You can save the data frame as csv and excel, and reload them as a DataFrame, with the labels being preserved:

df.to_csv('dat/scores_df.csv')
df.to_excel('dat/scores_df.xlsx')

# load data
df_from_file = pd.read_csv('dat/scores_df.csv', index_col=0, )
df_from_file

	Unnamed: 0	final score
intial score
0.086569	20230912	0.261860
0.270818	20230913	0.489482
0.470722	20230914	0.657003
0.559966	20230915	0.816802
0.848810	20230916	0.986865
0.910451	20230917	0.994659

Creating DataFrames#

We already learned that pandas provides versatile file I/O (csv, excel): https://pandas.pydata.org/docs/_sources/user_guide/io.html (see Week3/Files)

As shown above, a DataFrame can be created from Dictionary.

Data is 2D and is organized in columns and rows (rows=index). If we do not specify an index during DataFrame creation, it will be generated automatically, as a row numbers:

d = {'column 1': ['a','b','c'], 'column 2': [10, 20, 30]}

print(d)
df = pd.DataFrame(d)
df

{'column 1': ['a', 'b', 'c'], 'column 2': [10, 20, 30]}

	column 1	column 2
0	a	10
1	b	20
2	c	30

A DataFrame can also be created from a 2D np.array. In that case, we can specify the column names separately:

# scores
days

['20230912', '20230913', '20230914', '20230915', '20230916', '20230917']

df = pd.DataFrame(scores, columns=['initial', 'final'], index=days)
df

	initial	final
20230912	0.086569	0.261860
20230913	0.270818	0.489482
20230914	0.470722	0.657003
20230915	0.559966	0.816802
20230916	0.848810	0.986865
20230917	0.910451	0.994659

Accessing data in DataFrames#

There are many ways of accessing data in a DataFrame. We cover the basics here.

More details: https://pandas.pydata.org/docs/_sources/user_guide/indexing.html

Access columns like a dictionary#

df['final']  # df[column_name]

20230912    0.261860
20230913    0.489482
20230914    0.657003
20230915    0.816802
20230916    0.986865
20230917    0.994659
Name: final, dtype: float64

Access rows and columns by name like a 2D dictionary via df.loc#

df.loc[index_name, column_name]

df = pd.DataFrame(scores, columns=['initial', 'final'], index=days)
display(df)

df.loc[:, 'initial']


# print("by row name - all columns: df.loc['20230915']")
# print(df.loc['20230915'])

# print("this is equivalent to: df.loc['20230915', :]")
# print(df.loc['20230915', :])

# print("by column name - all rows, one column: df.loc[:, 'initial']")
# print(df.loc[:, 'initial'])

# print("\nby row and column name - returns a single cell in the table: df.loc['20230915', 'initial']")
# print(df.loc['20230915', 'initial'])

	initial	final
20230912	0.086569	0.261860
20230913	0.270818	0.489482
20230914	0.470722	0.657003
20230915	0.559966	0.816802
20230916	0.848810	0.986865
20230917	0.910451	0.994659

20230912    0.086569
20230913    0.270818
20230914    0.470722
20230915    0.559966
20230916    0.848810
20230917    0.910451
Name: initial, dtype: float64

Access to rows and columns via a numerical index like a 2D array via df.iloc#

df.iloc[row_number, column_number]

print(df.iloc[:, 1])
print(df.iloc[1])
print(df.iloc[1, :])
print(df.iloc[2, 1])
print(df.iloc[:2, 1])  # slicing also works

20230912    0.261860
20230913    0.489482
20230914    0.657003
20230915    0.816802
20230916    0.986865
20230917    0.994659
Name: final, dtype: float64
initial    0.270818
final      0.489482
Name: 20230913, dtype: float64
initial    0.270818
final      0.489482
Name: 20230913, dtype: float64
0.6570028706902653
20230912    0.261860
20230913    0.489482
Name: final, dtype: float64

Mini-Exercise:#

Get from the DataFrame below:

all data from the ‘test’ column
all data from the ‘…’ row
data from the cell at row ‘…’ and column ‘…’
the data from the 5th row

df = pd.DataFrame(...)

Access the underlying data like a numpy array:#

data = df.values
data, type(data)

(array([[0.08656877, 0.26186048],
        [0.27081788, 0.48948194],
        [0.47072199, 0.65700287],
        [0.55996639, 0.81680243],
        [0.84880959, 0.98686545],
        [0.91045131, 0.99465925]]),
 numpy.ndarray)

Boolean indexing also works#

df[df['initial'] < 0.4]

	initial	final
20230912	0.086569	0.261860
20230913	0.270818	0.489482

You can apply boolean indices repeatedly, to filter by different conditions:

df = pd.DataFrame({'name': ['Tim', 'Jim', 'Pim', 'Pip', 'Tom'],
                   'age': [12, 24, 18, 22, 26],
                   'score': [4, 0, 8, 9, 7]})
display(df)

# first, keep only the old one - age>20 yrs
old_guys = df[df['age']> 20]
display(old_guys)

# then, filter the dataframe with the old ones, to keep only the high scores - score>4
high_scorer = old_guys[old_guys['score']> 4]
display(high_scorer)

	name	age	score
0	Tim	12	4
1	Jim	24	0
2	Pim	18	8
3	Pip	22	9
4	Tom	26	7

	name	age	score
1	Jim	24	0
3	Pip	22	9
4	Tom	26	7

	name	age	score
3	Pip	22	9
4	Tom	26	7

Groupby: Split-apply-combine#

A common problem in data analysis is:

We have an experiment with multiple subjects, and multiple measurements per subject. We’d like to compute the average score for each subject for plotting and statistics.

For instance:

# REMOVE - make dummy data for grouped stats
import matplotlib.pyplot as plt
nb_animals = 10
nb_trials = 6

scores = np.zeros((nb_animals * nb_trials, 3))
for ani, animal in enumerate(range(nb_animals)):
    scores[(ani) * nb_trials:(ani + 1) * nb_trials, 1:] = 2+np.random.randn(nb_trials, 2)/2 + np.array([0.3, 0.8]) + np.random.randn(1) / 2
    scores[(ani) * nb_trials:(ani + 1) * nb_trials, 0] = animal

plt.plot(scores)

df = pd.DataFrame({'animal': scores[:, 0], 'pre': scores[:,1], 'post': scores[:,2]})
df.to_csv('dat/groupby.csv', index=None)

../../_images/f6acdc24485263796875ab9a9be96d886f2d71d3ca26c111ae9127ae6e3bb22d.png

df = pd.read_csv('dat/pd_groupby.csv')
df

	animal	pre	post
0	0.0	2.211986	3.107796
1	0.0	2.204871	2.830730
2	0.0	2.560633	3.483980
3	0.0	1.476412	2.801709
4	0.0	3.662519	3.447005
5	0.0	2.538696	3.301515
6	1.0	1.867229	3.484270
7	1.0	3.306150	3.077195
8	1.0	2.741388	2.586878
9	1.0	1.874149	3.264328
10	1.0	3.276006	2.503612
11	1.0	3.363795	3.175878
12	2.0	3.645122	3.871356
13	2.0	4.419345	3.223414
14	2.0	3.157167	2.851897
15	2.0	3.075222	3.591049
16	2.0	3.085689	4.618074
17	2.0	2.799818	3.586372
18	3.0	2.717420	3.483265
19	3.0	1.267145	2.402146
20	3.0	3.497071	2.818079
21	3.0	1.207408	2.662157
22	3.0	2.392043	2.872839
23	3.0	2.845028	2.683747
24	4.0	1.963877	1.633267
25	4.0	2.557404	1.549594
26	4.0	1.115516	2.964476
27	4.0	1.830395	1.952609
28	4.0	2.170095	2.525734
29	4.0	2.586667	2.240448
30	5.0	2.045121	3.573011
31	5.0	2.054961	3.010360
32	5.0	2.021420	3.538843
33	5.0	1.626790	2.702867
34	5.0	2.342304	2.389883
35	5.0	1.800452	3.311980
36	6.0	0.641648	1.322786
37	6.0	2.275206	0.707293
38	6.0	1.090543	1.291870
39	6.0	1.289345	1.976258
40	6.0	1.828772	1.341221
41	6.0	0.982321	2.676962
42	7.0	3.179301	4.185368
43	7.0	3.756986	3.681570
44	7.0	2.070850	3.249996
45	7.0	2.405634	3.025023
46	7.0	3.773297	3.787543
47	7.0	3.351603	3.171412
48	8.0	2.075409	3.820043
49	8.0	1.989337	2.970652
50	8.0	2.469318	2.046087
51	8.0	2.054879	2.514335
52	8.0	1.448087	2.380560
53	8.0	2.552318	3.016450
54	9.0	2.369584	1.889417
55	9.0	3.086123	2.560759
56	9.0	2.097187	2.446292
57	9.0	1.857415	0.800996
58	9.0	1.498517	1.811201
59	9.0	0.719002	2.691326

We can solve this using for loops and boolean indexing:

# identify all animals
animals = df['animal'].unique()

# example of boolean indexing
display(df[df['animal']==0.0])
# compute the mean:
display(df[df['animal']==0.0].mean())
display(np.mean(df[df['animal']==0.0], axis=0))

# loop over all animals
animal_avg = []
for animal in animals:
    animal_avg.append(df[df['animal']==animal].mean().values)
animal_avg

	pre	post
0	2.211986	3.107796
1	2.204871	2.830730
2	2.560633	3.483980
3	1.476412	2.801709
4	3.662519	3.447005
5	2.538696	3.301515

animal    0.000000
pre       2.442519
post      3.162122
dtype: float64

animal    0.000000
pre       2.442519
post      3.162122
dtype: float64

[array([0.        , 2.44251943, 3.16212242]),
 array([1.        , 2.73811958, 3.01536034]),
 array([2.        , 3.36372726, 3.62369364]),
 array([3.        , 2.32101921, 2.82037202]),
 array([4.        , 2.03732544, 2.14435453]),
 array([5.        , 1.98184125, 3.0878241 ]),
 array([6.        , 1.35130571, 1.55273191]),
 array([7.        , 3.08961177, 3.51681884]),
 array([8.        , 2.09822498, 2.79135462]),
 array([9.        , 1.9379714 , 2.03333186])]

This series of steps is so common, that it has been implemented in pandas: https://pandas.pydata.org/docs/_sources/user_guide/groupby.html

It is applied in two steps: groupby and aggregate

groupby groups the rows that have the same value in a specific column together
aggregate applies a computation that aggregates all data in a group to a single number, like the mean, standard deviation, or max.

Here is an example of this in action. df.groupby(column_name) will produce a new object, that groups all rows with the same value in the specified column into a “virtual subtable”:

grouped = df.groupby('animal')
print(type(grouped))
grouped.get_group(0)  # get the subtable for all rows where animal==0.0

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>

	pre	post
0	2.211986	3.107796
1	2.204871	2.830730
2	2.560633	3.483980
3	1.476412	2.801709
4	3.662519	3.447005
5	2.538696	3.301515

Apply the “mean” operation to all values in each subtable and make a new table with the aggregate per-group data:

agg = grouped.aggregate(np.mean)
agg

/var/folders/zr/6ql4dzjx0tq8mpzht_2dwh480000gn/T/ipykernel_3584/1479352327.py:1: FutureWarning: The provided callable <function mean at 0x1301a4a40> is currently using DataFrameGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  agg = grouped.aggregate(np.mean)

	pre	post
animal
0.0	2.442519	3.162122
1.0	2.738120	3.015360
2.0	3.363727	3.623694
3.0	2.321019	2.820372
4.0	2.037325	2.144355
5.0	1.981841	3.087824
6.0	1.351306	1.552732
7.0	3.089612	3.516819
8.0	2.098225	2.791355
9.0	1.937971	2.033332

Plot the result:

import matplotlib.pyplot as plt

x = np.zeros_like(agg.values).T + [[0], [1]]
y = agg.values.T

plt.figure(figsize=(3, 6))
plt.plot(x, y, 'o-k')
plt.xticks([0, 1], labels=agg.columns)
plt.ylabel('Score')
plt.show()

../../_images/2ebcf516b792bf9aaa83cae06004042bc6c3e461b470313de10ffbf75feffa98.png

Example: Connectomics#

The connectome of the (female) fly brain was recently published.

It can be accessed usign a beautiful web interface and user friendly web site: https://codex.flywire.ai.

The web site is great for exploring the fly brain, but clicking around in a web browser has limitations in terms of the reproducibility and scale of an analysis.

Therefore, the data was made availabel as a set of tables that can be loaded and processed with pandas.

In the exercise, we will use pandas to answer neuroscience questions about the fly brain - here is a brief introduction to the data:

There are three main tables:

connections - connections between cells
classification - cell classes (afferent, intrinsic, efferent), …
(labels - human annotations of cell types)

Each neuron has a unique identifier, the root_id. The root_id can be used to track neurons across the different tables. For instance, to find all neurons of a specific types, and then find their inputs or outputs.

Classifications and connections#

The classification table has basic information about each of the 130k neurons in the fly brain: the cell_type, side of the brain, position in the brain etc.

classification = pd.read_csv('dat/flywire/classification.csv.gz')
classification

	root_id	flow	super_class	class	sub_class	cell_type	hemibrain_type	hemilineage	side	nerve
0	720575940627005443	intrinsic	optic	L1-5	NaN	L4	NaN	NaN	right	NaN
1	720575940615995398	intrinsic	optic	L1-5	NaN	L4	NaN	NaN	right	NaN
2	720575940621762567	afferent	ascending	AN	NaN	NaN	NaN	NaN	right	CV
3	720575940624384007	afferent	sensory	olfactory	NaN	NaN	ORN_VA1v	NaN	left	AN
4	720575940614422540	intrinsic	central	NaN	NaN	NaN	AOTU032,AOTU034	LALa1_posterior	right	NaN
...	...	...	...	...	...	...	...	...	...	...
127974	720575940621500404	intrinsic	visual_projection	NaN	NaN	NaN	LLPC1,LLPC2a,LLPC2b,LLPC2c,LLPC2d,LLPC3	NaN	right	NaN
127975	720575940651646966	intrinsic	central	NaN	NaN	NaN	NaN	LB0_anterior	center	NaN
127976	720575940630151162	intrinsic	optic	optic_lobes	NaN	NaN	NaN	NaN	left	NaN
127977	720575940629364731	intrinsic	optic	optic_lobes	NaN	NaN	NaN	NaN	right	NaN
127978	720575940623859711	afferent	sensory	visual	NaN	R1-6	NaN	NaN	right	NaN

127979 rows × 10 columns

The connections table contains information about all ~4M synaptic synaptic connections:

pre_root_id: root_id of the presynaptic neuron, the source
post_root_id: root_id of the postsynaptic neuron, the target
syn_count: number of synapses in connection
nt_type: predicted neurotransmitter neurotransmitter
neuropil: target neuropil

connections = pd.read_csv('dat/flywire/connections.csv.gz')
connections

	pre_root_id	post_root_id	neuropil	syn_count	nt_type
0	720575940596125868	720575940608552405	LOP_R	5	ACH
1	720575940596125868	720575940611348834	LOP_R	7	ACH
2	720575940596125868	720575940613059993	LOP_R	5	GLUT
3	720575940596125868	720575940616986553	LOP_R	5	ACH
4	720575940596125868	720575940620124326	LOP_R	8	ACH
...	...	...	...	...	...
3794610	720575940660868737	720575940607206786	ME_L	9	GABA
3794611	720575940660868737	720575940608664873	ME_L	6	GABA
3794612	720575940660868737	720575940611462242	ME_L	6	GABA
3794613	720575940660868737	720575940622913063	ME_L	23	GABA
3794614	720575940660868737	720575940626553546	ME_L	6	ACH

3794615 rows × 5 columns

Example: Find neurons of a specific cell type#

LC10a is a cluster of higher-order visual neurons required by Drosophila males to track the female during courtship.

We can find all LC10a neurons in the fly brain using boolean indexing into the classification table:

LC10a_neurons = classification[classification['cell_type']=='LC10a']
LC10a_neurons

	root_id	flow	super_class	class	sub_class	cell_type	hemibrain_type	hemilineage	side	nerve
3035	720575940620720112	intrinsic	visual_projection	NaN	NaN	LC10a	LC10	VPNd2	right	NaN
3691	720575940638285109	intrinsic	visual_projection	NaN	NaN	LC10a	LC10	VPNd2	left	NaN
3737	720575940614430111	intrinsic	visual_projection	NaN	NaN	LC10a	LC10	VPNd2	right	NaN
4456	720575940623082237	intrinsic	visual_projection	NaN	NaN	LC10a	LC10	VPNd2	right	NaN
4544	720575940623868853	intrinsic	visual_projection	NaN	NaN	LC10a	LC10	VPNd2	left	NaN
...	...	...	...	...	...	...	...	...	...	...
125773	720575940631195217	intrinsic	visual_projection	NaN	NaN	LC10a	LC10	VPNd2	left	NaN
126409	720575940631982929	intrinsic	visual_projection	NaN	NaN	LC10a	LC10	VPNd2	left	NaN
126528	720575940631983185	intrinsic	visual_projection	NaN	NaN	LC10a	LC10	VPNd2	left	NaN
126843	720575940627265223	intrinsic	visual_projection	NaN	NaN	LC10a	LC10	VPNd2	left	NaN
127467	720575940610227162	intrinsic	visual_projection	NaN	NaN	LC10a	LC10	VPNd2	left	NaN

227 rows × 10 columns

Example: Find all postsynaptic partners of a given neuron#

# select the first LC10a neuron in the list and get it's root_id
pre_root_id = LC10a_neurons.iloc[0]['root_id']
print(pre_root_id)

# find all downstream targets for that neuron
# that is, all synaptic connections for which our neuron's root_id is the root_id of the pre-synapse
targets = connections[connections['pre_root_id']==pre_root_id]
targets

720575940620720112

	pre_root_id	post_root_id	neuropil	syn_count	nt_type
1309977	720575940620720112	720575940610389873	AOTU_R	10	ACH
1309978	720575940620720112	720575940613946783	AOTU_R	10	ACH
1309979	720575940620720112	720575940614041238	AOTU_R	9	ACH
1309980	720575940620720112	720575940616012061	AOTU_R	26	ACH
1309981	720575940620720112	720575940616169629	AOTU_R	14	ACH
1309982	720575940620720112	720575940616888020	AOTU_R	7	ACH
1309983	720575940620720112	720575940619453861	AOTU_R	5	ACH
1309984	720575940620720112	720575940620321158	AOTU_R	11	ACH
1309985	720575940620720112	720575940621925631	AOTU_R	7	ACH
1309986	720575940620720112	720575940622538520	AOTU_R	24	ACH
1309987	720575940620720112	720575940623107509	AOTU_R	5	ACH
1309988	720575940620720112	720575940626024336	AOTU_R	5	ACH
1309989	720575940620720112	720575940626643658	AOTU_R	6	ACH
1309990	720575940620720112	720575940627738640	AOTU_R	22	ACH
1309991	720575940620720112	720575940628916444	AOTU_R	13	ACH
1309992	720575940620720112	720575940628972538	AOTU_R	8	ACH
1309993	720575940620720112	720575940629883971	AOTU_R	17	ACH
1309994	720575940620720112	720575940631517251	AOTU_R	31	ACH
1309995	720575940620720112	720575940638668659	AOTU_R	11	ACH
1309996	720575940620720112	720575940652922529	AOTU_R	11	ACH