Working with files#
Experimental data is saved in a bunch of files - often one file per animal or experiment. If you want to analyse the data for several animals or experiments with python you need to:
Discover and list these data files, so you can process each of them automatically in a for loop.
Read the data in the files into python variables for later manipulation and save the results stored in python variables to results files.
Loading and saving data to/from files#
Text files:
.txt
or.csv
Excel files ending in
.xls
or.xlsx
Matlab files ending in
.mat
Numpy files ending in
.npy
,.npz
Loading data from text files#
Data in text files is often saved as a single column of data - with one value per row/line and sometimes with a column label:
Responses
0.561
0.342
0.23
0.144
Tabular data with multiple columns is saved with a specific character separating the individual columns in a row: the delimiter. Common delimiters are ,
(csv - comma-sparated values), ;
, or \tab
.
Time,Responses
1,0.561
2,0.342
3,0.23
4,0.144
We can use numpy or the pandas library for loading and saving data. Numpy and pandas are numerical computation packages, and come with function for data IO.
import numpy as np
data = np.loadtxt('data/mouse1.txt')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
ValueError: could not convert string to float: 'Responses'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
Cell In[67], line 2
1 import numpy as np
----> 2 data = np.loadtxt('data/mouse1.txt')
File ~/miniconda3/envs/neu715/lib/python3.12/site-packages/numpy/lib/_npyio_impl.py:1397, in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding, max_rows, quotechar, like)
1394 if isinstance(delimiter, bytes):
1395 delimiter = delimiter.decode('latin1')
-> 1397 arr = _read(fname, dtype=dtype, comment=comment, delimiter=delimiter,
1398 converters=converters, skiplines=skiprows, usecols=usecols,
1399 unpack=unpack, ndmin=ndmin, encoding=encoding,
1400 max_rows=max_rows, quote=quotechar)
1402 return arr
File ~/miniconda3/envs/neu715/lib/python3.12/site-packages/numpy/lib/_npyio_impl.py:1036, in _read(fname, delimiter, comment, quote, imaginary_unit, usecols, skiplines, max_rows, converters, ndmin, unpack, dtype, encoding)
1033 data = _preprocess_comments(data, comments, encoding)
1035 if read_dtype_via_object_chunks is None:
-> 1036 arr = _load_from_filelike(
1037 data, delimiter=delimiter, comment=comment, quote=quote,
1038 imaginary_unit=imaginary_unit,
1039 usecols=usecols, skiplines=skiplines, max_rows=max_rows,
1040 converters=converters, dtype=dtype,
1041 encoding=encoding, filelike=filelike,
1042 byte_converters=byte_converters)
1044 else:
1045 # This branch reads the file into chunks of object arrays and then
1046 # casts them to the desired actual dtype. This ensures correct
1047 # string-length and datetime-unit discovery (like `arr.astype()`).
1048 # Due to chunking, certain error reports are less clear, currently.
1049 if filelike:
ValueError: could not convert string 'Responses' to float64 at row 0, column 1.
Open the file and inspect it! Can you guess the cause of the error?
This is because numpy wants to load the data into an array, but does not know how to deal with the column title in the text file (open the file in a text editor to check). We can skip the column header in the first row of the file using the skiprows
keyword argument:
np_data = np.loadtxt('data/mouse1.txt', skiprows=1)
print(np_data, type(np_data))
[ 1. 2. 3. 4. 5. 56. 6.] <class 'numpy.ndarray'>
Saving data to text files#
We can use numpy to save any numpy array to a text file:
data = [1.3, 2.2, 3.5, 4.8]
np.savetxt('test.txt', data, header='Responses') # the "header" argument is optional
Loading data from excel files#
We can use the pandas library to load xls
or xlsx
files. Pandas loads data not into a list or a dictionary but has it’s own data type - the DataFrame
. For now, we can easily convert the data from a DataFrame
to a list or a numpy array.
Caution: Depending on the format of the excel file, you may need to install an additional package (‘xlrd’) to be able to load the file.
import pandas as pd
excel_data = pd.read_excel('data/mouse_data.xls', sheet_name="Mouse1")
excel_data
Responses | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
4 | 5 |
5 | 6 |
6 | 7 |
7 | 8 |
8 | 9 |
9 | 10 |
10 | 11 |
11 | 12 |
12 | 13 |
13 | 14 |
14 | 15 |
What’s the type of excel_data
?
print(f"The type of df is {type(excel_data)}.")
The type of df is <class 'pandas.core.frame.DataFrame'>.
We will not cover DataFrame
here. But we can easily convert the DataFrame to a numpy array:
excel_data.to_numpy() # to a numpy array
array([[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[10],
[11],
[12],
[13],
[14],
[15]])
Loading data from matlab files#
Both will open the file as a dictionary, with variable names as keys and the data as values.
Saving data to numpy files#
A more efficient (faster load/save, less disk space) format is the numpy file. There are two ways of saving data to numpy files.
You can save a single variable using the np.save
. The resulting file should have the extension npy
.
import numpy as np
time = np.array([1, 2, 3, 4])
print(time)
np.save('saved_single_variable.npy', time)
[1 2 3 4]
You can load the variabe from the file using np.load
. Since the file only contained a single variable, np.load
will return the values of the variable directly:
loaded_npy = np.load('saved_single_variable.npy')
loaded_npy
array([1, 2, 3, 4])
You can also save multiple variables to a single file using np.savez
, which will create a file with the extension npz
.
time_to_save = np.array([1, 2, 3, 4])
voltage_to_save = np.array([-60, -60, 40, -60])
np.savez('saved_multiple_variables.npz', time=time_to_save, voltage=voltage_to_save) # the keywords "time" and "voltage" will be the names associated with the saved data variables.
np.savez('saved_multiple_variables_NO.npz', time_to_save, voltage_to_save) # the keywords "time" and "voltage" will be the names associated with the saved data variables.
Loading an npz
files will return a dictionary-like data structure, with the names of the saved variables as keys and the variables’ data as values:
loaded_npz = np.load('saved_multiple_variables.npz')
print(loaded_npz)
print("time:", loaded_npz['time'])
for key, val in loaded_npz.items():
print(f'Variable "{key}" with values "{val}".')
NpzFile 'saved_multiple_variables.npz' with keys: time, voltage
time: [1 2 3 4]
Variable "time" with values "[1 2 3 4]".
Variable "voltage" with values "[-60 -60 40 -60]".
Discovering files#
If we have data from multiple experiments or animals, we want to automatically process everything with python. To do that we need to find a way to find out what files are on our computer and where they are so we can load them.
Say we have experimental data stored in the following directory structure:
experiments
├── experiment_1
│ ├── final_behavioral_scores.txt
│ └── initial_behavioral_scores.txt
├── experiment_2
│ ├── final_behavioral_scores.txt
│ └── initial_behavioral_scores.txt
├── information.txt
└── mouse_names.txt
A bit of nomenclature: Take this path experiments/experiment_1/final_behavioral_scores.txt
final_behavioral_scores.txt
is the file name,final_behavioral_scores
is called the file stem,.txt
is called the suffix or extensionexperiments
andexperiment_1
are directories or folders.experiments
is the parent directory ofexperiment_1
.experiment_1
is a subdirectory ofexperiments
/
is the path separator.
The glob module allows you to list directories or files in a directory.
Wild card characters allow you to find file names matching a specific pattern:
?
matches individual characters*
matches any string of characters
from glob import glob
print(f"{glob('experiments/*')=}") # find all files and directories in the experiments directory
print(f"{glob('experiments/*.txt')=}") # find files ending in '.txt'
print(f"{glob('experiments/i*.txt')=}") # find files and directories in 'experiments', starting with 'i', and ending in '.txt'
print(f"{glob('experiments/experiment_?')=}") # find files and directories in 'experiments', starting with 'experiment_', and ending in a single unknown character.
print(f"{glob('experiments/*/')=}") # find all subdirectories
print(f"{glob('experiments/*/*.txt')=}") # find 'txt' files in all subdirectories
glob('experiments/*')=['experiments/experiment_1', 'experiments/information.txt', 'experiments/mouse_names.txt', 'experiments/experiment_2']
glob('experiments/*.txt')=['experiments/information.txt', 'experiments/mouse_names.txt']
glob('experiments/i*.txt')=['experiments/information.txt']
glob('experiments/experiment_?')=['experiments/experiment_1', 'experiments/experiment_2']
glob('experiments/*/')=['experiments/experiment_1/', 'experiments/experiment_2/']
glob('experiments/*/*.txt')=['experiments/experiment_1/initial_behavioral_scores.txt', 'experiments/experiment_1/final_behavioral_scores.txt', 'experiments/experiment_2/initial_behavioral_scores.txt', 'experiments/experiment_2/final_behavioral_scores.txt']
Manipulating paths#
We often need to manipulate path names.
Say we want to process the data in initial_behavioral_scores.txt
and final_behavioral_scores.txt
for each experiment in experiments
, and want to save the results in a folder called results
that mimics the structure of the data folder: results/experiment_1/behavior.xls
, results/experiment_2/behavior.xls
.
We want to generate the paths for the new results files automatically from the paths of the data files. That means we need to manipulate the path names. In one exercise, you will do just that!!
There are two ways of working with paths in python
import os.path
my_path = 'parentdir/subdir/name.txt'
print(f"{os.path.splitext(my_path)=}")
my_path_parts = os.path.splitext(my_path)
trunk = my_path_parts[0]
extension = my_path_parts[1]
print(f"{my_path=}, {trunk=}, {extension=}")
# this will split off the file name from the rest of the path:
trunk, head = os.path.split(my_path)
print(f"{trunk=}, {head=}")
# we can split the trunk again to split off the sub directory
new_trunk, new_head = os.path.split(trunk)
print(f"{new_trunk=}, {new_head=}")
print(f"{os.path.basename(my_path)=}") # returns the filename
print(f"{os.path.dirname(my_path)=}") # returns the directories
os.path.splitext(my_path)=('parentdir/subdir/name', '.txt')
my_path='parentdir/subdir/name.txt', trunk='parentdir/subdir/name', extension='.txt'
trunk='parentdir/subdir', head='name.txt'
new_trunk='parentdir', new_head='subdir'
os.path.basename(my_path)='name.txt'
os.path.dirname(my_path)='parentdir/subdir'
We now know how to split a path in various ways. How can we assemble the parts into something new?
We can use os.path.join
with multiple strings as arguments to join them into a new path. This function will use the correct path separator.
new_path = os.path.join('directory_name', 'subdirectory_name', 'file_name.ext')
print(new_path)
directory_name/subdirectory_name/file_name.ext
What if we want to change the file extension? Say our data file is an excel file (extension ‘.xlsx’) and we want to save the results from analyzing that file to a text file (extension ‘.txt’).
Since the parts are strings, we can use the +
operator to concatenate them. This will create a file name with a new suffix:
data_file_name = 'mouse_276.xls'
print('Load data from', data_file_name)
# First split off the old extension:
data_file_trunk, data_file_ext = os.path.splitext(data_file_name)
# Then add the new extension to the file trunk
results_file_name = data_file_trunk + '.txt'
print('Save result to', results_file_name)
Load data from mouse_276.xls
Save result to mouse_276.txt
Creating directories#
To save files to new directories, we can create them directly with python:
os.makedirs('tmp/sub1/sub2', exist_ok=True)
exist_ok=True
prevents an error if the directory already exists.