Working with files

Working with files#

Experimental data is saved in a bunch of files - often one file per animal or experiment. If you want to analyse the data for several animals or experiments with python you need to:

Discover and list these data files, so you can process each of them automatically in a for loop.
Read the data in the files into python variables for later manipulation and save the results stored in python variables to results files.

Loading and saving data to/from files#

Text files: .txt or .csv
Excel files ending in .xls or .xlsx
Matlab files ending in .mat
Numpy files ending in .npy, .npz

Loading data from text files#

Data in text files is often saved as a single column of data - with one value per row/line and sometimes with a column label:

Responses
561
342
23
144

Tabular data with multiple columns is saved with a specific character separating the individual columns in a row: the delimiter. Common delimiters are , (csv - comma-sparated values), ;, or \tab.

Time,Responses
1,0.561
2,0.342
3,0.23
4,0.144

We can use numpy or the pandas library for loading and saving data. Numpy and pandas are numerical computation packages, and come with function for data IO.

import numpy as np
data = np.loadtxt('data/mouse1.txt')

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

ValueError: could not convert string to float: 'Responses'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)

Cell In[67], line 2

      1 import numpy as np

----> 2 data = np.loadtxt('data/mouse1.txt')

File ~/miniconda3/envs/neu715/lib/python3.12/site-packages/numpy/lib/_npyio_impl.py:1397, in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding, max_rows, quotechar, like)

   1394 if isinstance(delimiter, bytes):

   1395     delimiter = delimiter.decode('latin1')

-> 1397 arr = _read(fname, dtype=dtype, comment=comment, delimiter=delimiter,

   1398             converters=converters, skiplines=skiprows, usecols=usecols,

   1399             unpack=unpack, ndmin=ndmin, encoding=encoding,

   1400             max_rows=max_rows, quote=quotechar)

   1402 return arr

File ~/miniconda3/envs/neu715/lib/python3.12/site-packages/numpy/lib/_npyio_impl.py:1036, in _read(fname, delimiter, comment, quote, imaginary_unit, usecols, skiplines, max_rows, converters, ndmin, unpack, dtype, encoding)

   1033     data = _preprocess_comments(data, comments, encoding)

   1035 if read_dtype_via_object_chunks is None:

-> 1036     arr = _load_from_filelike(

   1037         data, delimiter=delimiter, comment=comment, quote=quote,

   1038         imaginary_unit=imaginary_unit,

   1039         usecols=usecols, skiplines=skiplines, max_rows=max_rows,

   1040         converters=converters, dtype=dtype,

   1041         encoding=encoding, filelike=filelike,

   1042         byte_converters=byte_converters)

   1044 else:

   1045     # This branch reads the file into chunks of object arrays and then

   1046     # casts them to the desired actual dtype.  This ensures correct

   1047     # string-length and datetime-unit discovery (like `arr.astype()`).

   1048     # Due to chunking, certain error reports are less clear, currently.

   1049     if filelike:

ValueError: could not convert string 'Responses' to float64 at row 0, column 1.

Open the file and inspect it! Can you guess the cause of the error?

This is because numpy wants to load the data into an array, but does not know how to deal with the column title in the text file (open the file in a text editor to check). We can skip the column header in the first row of the file using the skiprows keyword argument:

np_data = np.loadtxt('data/mouse1.txt', skiprows=1)
print(np_data, type(np_data))

[ 1.  2.  3.  4.  5. 56.  6.] <class 'numpy.ndarray'>

Saving data to text files#

We can use numpy to save any numpy array to a text file:

data = [1.3, 2.2, 3.5, 4.8]
np.savetxt('test.txt', data, header='Responses')  # the "header" argument is optional

Loading data from excel files#

We can use the pandas library to load xls or xlsx files. Pandas loads data not into a list or a dictionary but has it’s own data type - the DataFrame. For now, we can easily convert the data from a DataFrame to a list or a numpy array.

Caution: Depending on the format of the excel file, you may need to install an additional package (‘xlrd’) to be able to load the file.

import pandas as pd
excel_data = pd.read_excel('data/mouse_data.xls', sheet_name="Mouse1")
excel_data

	Responses
0	1
1	2
2	3
3	4
4	5
5	6
6	7
7	8
8	9
9	10
10	11
11	12
12	13
13	14
14	15

What’s the type of excel_data?

print(f"The type of df is {type(excel_data)}.")

The type of df is <class 'pandas.core.frame.DataFrame'>.

We will not cover DataFrame here. But we can easily convert the DataFrame to a numpy array:

excel_data.to_numpy()  # to a numpy array

array([[ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10],
       [11],
       [12],
       [13],
       [14],
       [15]])

Loading data from matlab files#

old format: scipy.io.loadmat(filename) (docs)
new format: Open as an hdf5 file using h5py (docs)

Both will open the file as a dictionary, with variable names as keys and the data as values.

Saving data to numpy files#

A more efficient (faster load/save, less disk space) format is the numpy file. There are two ways of saving data to numpy files.

You can save a single variable using the np.save. The resulting file should have the extension npy.

import numpy as np
time = np.array([1, 2, 3, 4])
print(time)
np.save('saved_single_variable.npy', time)

[1 2 3 4]

You can load the variabe from the file using np.load. Since the file only contained a single variable, np.load will return the values of the variable directly:

loaded_npy = np.load('saved_single_variable.npy')
loaded_npy

array([1, 2, 3, 4])

You can also save multiple variables to a single file using np.savez, which will create a file with the extension npz.

time_to_save = np.array([1, 2, 3, 4])
voltage_to_save = np.array([-60, -60, 40, -60])
np.savez('saved_multiple_variables.npz', time=time_to_save, voltage=voltage_to_save)  # the keywords "time" and "voltage" will be the names associated with the saved data variables.
np.savez('saved_multiple_variables_NO.npz', time_to_save, voltage_to_save)  # the keywords "time" and "voltage" will be the names associated with the saved data variables.

Loading an npz files will return a dictionary-like data structure, with the names of the saved variables as keys and the variables’ data as values:

loaded_npz = np.load('saved_multiple_variables.npz')
print(loaded_npz)

print("time:", loaded_npz['time'])

for key, val in loaded_npz.items():
    print(f'Variable "{key}" with values "{val}".')

NpzFile 'saved_multiple_variables.npz' with keys: time, voltage
time: [1 2 3 4]
Variable "time" with values "[1 2 3 4]".
Variable "voltage" with values "[-60 -60  40 -60]".

Discovering files#

If we have data from multiple experiments or animals, we want to automatically process everything with python. To do that we need to find a way to find out what files are on our computer and where they are so we can load them.

Say we have experimental data stored in the following directory structure:

experiments
├── experiment_1
│   ├── final_behavioral_scores.txt
│   └── initial_behavioral_scores.txt
├── experiment_2
│   ├── final_behavioral_scores.txt
│   └── initial_behavioral_scores.txt
├── information.txt
└── mouse_names.txt

A bit of nomenclature: Take this path experiments/experiment_1/final_behavioral_scores.txt

final_behavioral_scores.txt is the file name, final_behavioral_scores is called the file stem, .txt is called the suffix or extension
experiments and experiment_1 are directories or folders. experiments is the parent directory of experiment_1. experiment_1 is a subdirectory of experiments
/ is the path separator.

The glob module allows you to list directories or files in a directory.

Wild card characters allow you to find file names matching a specific pattern:

? matches individual characters
* matches any string of characters

from glob import glob
print(f"{glob('experiments/*')=}")  # find all files and directories in the experiments directory
print(f"{glob('experiments/*.txt')=}")  # find files ending in '.txt'
print(f"{glob('experiments/i*.txt')=}")  # find files and directories in 'experiments', starting with 'i', and ending in '.txt'
print(f"{glob('experiments/experiment_?')=}")  # find files and directories in 'experiments', starting with 'experiment_', and ending in a single unknown character.

print(f"{glob('experiments/*/')=}")  # find all subdirectories
print(f"{glob('experiments/*/*.txt')=}")  # find 'txt' files in all subdirectories

glob('experiments/*')=['experiments/experiment_1', 'experiments/information.txt', 'experiments/mouse_names.txt', 'experiments/experiment_2']
glob('experiments/*.txt')=['experiments/information.txt', 'experiments/mouse_names.txt']
glob('experiments/i*.txt')=['experiments/information.txt']
glob('experiments/experiment_?')=['experiments/experiment_1', 'experiments/experiment_2']
glob('experiments/*/')=['experiments/experiment_1/', 'experiments/experiment_2/']
glob('experiments/*/*.txt')=['experiments/experiment_1/initial_behavioral_scores.txt', 'experiments/experiment_1/final_behavioral_scores.txt', 'experiments/experiment_2/initial_behavioral_scores.txt', 'experiments/experiment_2/final_behavioral_scores.txt']

Manipulating paths#

We often need to manipulate path names.

Say we want to process the data in initial_behavioral_scores.txt and final_behavioral_scores.txt for each experiment in experiments, and want to save the results in a folder called results that mimics the structure of the data folder: results/experiment_1/behavior.xls, results/experiment_2/behavior.xls.

We want to generate the paths for the new results files automatically from the paths of the data files. That means we need to manipulate the path names. In one exercise, you will do just that!!

There are two ways of working with paths in python

os.path (old)
pathlib (new but more complicated - we won’t cover it here)

import os.path
my_path = 'parentdir/subdir/name.txt'

print(f"{os.path.splitext(my_path)=}")
my_path_parts = os.path.splitext(my_path)
trunk = my_path_parts[0]
extension = my_path_parts[1]
print(f"{my_path=}, {trunk=}, {extension=}")

# this will split off the file name from the rest of the path:
trunk, head = os.path.split(my_path)
print(f"{trunk=}, {head=}")

# we can split the trunk again to split off the sub directory
new_trunk, new_head = os.path.split(trunk)
print(f"{new_trunk=}, {new_head=}")

print(f"{os.path.basename(my_path)=}")  # returns the filename
print(f"{os.path.dirname(my_path)=}")  # returns the directories

os.path.splitext(my_path)=('parentdir/subdir/name', '.txt')
my_path='parentdir/subdir/name.txt', trunk='parentdir/subdir/name', extension='.txt'
trunk='parentdir/subdir', head='name.txt'
new_trunk='parentdir', new_head='subdir'
os.path.basename(my_path)='name.txt'
os.path.dirname(my_path)='parentdir/subdir'

We now know how to split a path in various ways. How can we assemble the parts into something new?

We can use os.path.join with multiple strings as arguments to join them into a new path. This function will use the correct path separator.

new_path = os.path.join('directory_name', 'subdirectory_name', 'file_name.ext')
print(new_path)

directory_name/subdirectory_name/file_name.ext

What if we want to change the file extension? Say our data file is an excel file (extension ‘.xlsx’) and we want to save the results from analyzing that file to a text file (extension ‘.txt’).

Since the parts are strings, we can use the + operator to concatenate them. This will create a file name with a new suffix:

data_file_name = 'mouse_276.xls'
print('Load data from', data_file_name)

# First split off the old extension:
data_file_trunk, data_file_ext = os.path.splitext(data_file_name)

# Then add the new extension to the file trunk
results_file_name = data_file_trunk + '.txt'
print('Save result to', results_file_name)

Load data from mouse_276.xls
Save result to mouse_276.txt

Creating directories#

To save files to new directories, we can create them directly with python:

os.makedirs('tmp/sub1/sub2', exist_ok=True)

exist_ok=True prevents an error if the directory already exists.