This file was created from the following Jupyter-notebook: docs/importdata.ipynb
Interactive version: Binder badge

Importing Data Example

In order to import data into pypillometry, we have to load the data from the source using other packages and then wrap it into PupilData objects.

Here we will show and example where we translate a file recorded in Eyelinks EDF-format to a file readable by pandas.read_table().

First, we import the needed modules.

[2]:
import sys, os
sys.path.insert(0,"..") # this is not needed if you have installed pypillometry
import pypillometry as pp
import pandas as pd
import numpy as np
import pylab as plt

In this example, we use data recorded with an Eyelink-eyetracker. These eyetrackers store the files in binary files with the extension .edf. Some information about this file-format is here. We use a command-line utility released by Eyelink to convert this proprietory format into a more easily read .asc file that is a whitespace-separated plain-text format. The converter, edf2asc is a program that can be downloaded for different platforms from the Eyelink support forum. There is a GUI-based program for windows and command-line programs for linux and mac. Binaries of the command-line tools for linux and mac are included in pypillometry under this link.

On linux, we would call these programs on an example edf-file twice as follows.

[1]:
!../external/edf2asc-mac -y -s ../data/test.edf ../data/test_samples.asc
!../external/edf2asc-mac -y -e ../data/test.edf ../data/test_events.asc

EDF2ASC: EyeLink EDF file -> ASCII (text) file translator
EDF2ASC version 3.1 MacOS X Jul 13 2010
(c)1995-2009 by SR Research, last modified Jul 13 2010

processing file ../data/test.edf
=======================Preamble of file ../data/test.edf=======================
| DATE: Fri Feb 14 08:48:33 2020                                              |
| TYPE: EDF_FILE BINARY EVENT SAMPLE TAGGED                                   |
| VERSION: EYELINK II 1                                                       |
| SOURCE: EYELINK CL                                                          |
| EYELINK II CL v6.12 Feb  1 2018 (EyeLink Portable Duo)                      |
| CAMERA: EyeLink USBCAM Version 1.01                                         |
| SERIAL NUMBER: CLU-DAC49                                                    |
| CAMERA_CONFIG: DAC49200.SCD                                                 |
| Psychopy GC demo                                                            |
===============================================================================

Converted successfully: 0 events, 1245363 samples, 6 blocks.

EDF2ASC: EyeLink EDF file -> ASCII (text) file translator
EDF2ASC version 3.1 MacOS X Jul 13 2010
(c)1995-2009 by SR Research, last modified Jul 13 2010

processing file ../data/test.edf
=======================Preamble of file ../data/test.edf=======================
| DATE: Fri Feb 14 08:48:33 2020                                              |
| TYPE: EDF_FILE BINARY EVENT SAMPLE TAGGED                                   |
| VERSION: EYELINK II 1                                                       |
| SOURCE: EYELINK CL                                                          |
| EYELINK II CL v6.12 Feb  1 2018 (EyeLink Portable Duo)                      |
| CAMERA: EyeLink USBCAM Version 1.01                                         |
| SERIAL NUMBER: CLU-DAC49                                                    |
| CAMERA_CONFIG: DAC49200.SCD                                                 |
| Psychopy GC demo                                                            |
===============================================================================

Converted successfully: 36371 events, 0 samples, 6 blocks.

This results in two files, one containing all the samples and one all the recorded events.

[3]:
fname_samples="../data/test_samples.asc"
fname_events="../data/test_events.asc"

The samples-files contains a large table containing the timestamp, x/y-coordinates for the eyeposition and pupil-area for both the left and the right eye. Here are the first few rows of this file:

[4]:
!head $fname_samples
3385900   817.3   345.2  1707.0   860.6   375.2  1738.0 .....
3385902   817.0   343.5  1706.0   860.7   375.9  1739.0 .....
3385904   816.7   341.6  1705.0   861.2   376.6  1739.0 .....
3385906   816.7   340.4  1706.0   861.7   376.8  1740.0 .....
3385908   816.7   340.2  1707.0   861.6   376.9  1742.0 .....
3385910   816.8   340.2  1708.0   861.1   377.1  1743.0 .....
3385912   816.9   340.9  1708.0   860.7   377.5  1744.0 .....
3385914   816.1   342.1  1710.0   861.1   378.7  1745.0 .....
3385916   815.2   343.2  1712.0   862.5   380.0  1746.0 .....
3385918   814.4   343.6  1713.0   863.9   380.7  1747.0 .....

We can easily read this file using pandas.read_csv().

[5]:
df=pd.read_table(fname_samples, index_col=False,
                  names=["time", "left_x", "left_y", "left_p",
                         "right_x", "right_y", "right_p"])
df
[5]:
time left_x left_y left_p right_x right_y right_p
0 3385900 817.3 345.2 1707.0 860.6 375.2 1738.0
1 3385902 817.0 343.5 1706.0 860.7 375.9 1739.0
2 3385904 816.7 341.6 1705.0 861.2 376.6 1739.0
3 3385906 816.7 340.4 1706.0 861.7 376.8 1740.0
4 3385908 816.7 340.2 1707.0 861.6 376.9 1742.0
... ... ... ... ... ... ... ...
1245358 5923060 . . 0.0 . . 0.0
1245359 5923062 . . 0.0 . . 0.0
1245360 5923064 . . 0.0 . . 0.0
1245361 5923066 . . 0.0 . . 0.0
1245362 5923068 . . 0.0 . . 0.0

1245363 rows × 7 columns

We can already use this information to create our PupilData-object. We simply pass in the pupil-area of the right eye (column right_p) and the timestamp-array from the samples-file (Note: we could just as easily have used the left eye or the mean of both):

[6]:
pp.PupilData(df.right_p, time=df.time, name="test")
> Filling in 5 gaps
[32.35   4.012  6.21   2.02   1.862] seconds
[6]:
PupilData(test, 135.5MiB):
 n                 : 1268585
 nmiss             : 212551
 perc_miss         : 16.75496714843704
 nevents           : 0
 nblinks           : 0
 ninterpolated     : 0
 blinks_per_min    : 0.0
 fs                : 500.0
 duration_minutes  : 42.28616666666667
 start_min         : 56.431666666666665
 end_min           : 98.7178
 baseline_estimated: False
 response_estimated: False
 History:
 *

Of course, this dataset is still missing the important information contained in the event-file which we will use for analysing trial-related pupil-diameter data. For that, we will have to read the events-file, which has a more complicated structure than the samples-file:

[7]:
!head -20 $fname_events
** CONVERTED FROM ../data/test.edf using edfapi 3.1 MacOS X Jul 13 2010 on Wed May 27 16:45:20 2020
** DATE: Fri Feb 14 08:48:33 2020
** TYPE: EDF_FILE BINARY EVENT SAMPLE TAGGED
** VERSION: EYELINK II 1
** SOURCE: EYELINK CL
** EYELINK II CL v6.12 Feb  1 2018 (EyeLink Portable Duo)
** CAMERA: EyeLink USBCAM Version 1.01
** SERIAL NUMBER: CLU-DAC49
** CAMERA_CONFIG: DAC49200.SCD
** Psychopy GC demo
**

INPUT   2767568 0
MSG     2784000 !CAL
>>>>>>> CALIBRATION (HV9,P-CR) FOR LEFT: <<<<<<<<<
MSG     2784000 !CAL Calibration points:
MSG     2784000 !CAL -29.4, -23.5        -0,     -2
MSG     2784000 !CAL -29.3, -35.7        -0,  -1544
MSG     2784000 !CAL -32.9, -10.4        -0,   1559
MSG     2784000 !CAL -49.7, -23.0     -2835,     -2

After a header (lines starting with ‘**’) containing meta-information, we get a sequence of “events” which have different formats for all rows. We are interested in lines starting with “MSG” because those contain our experimental markers. Therefore, we read the samples file and remove all rows that do not start with “MSG” first:

[8]:
# read the whole file into variable `events` (list with one entry per line)
with open(fname_events) as f:
    events=f.readlines()

# keep only lines starting with "MSG"
events=[ev for ev in events if ev.startswith("MSG")]
events[0:10]
[8]:
['MSG\t2784000 !CAL \n',
 'MSG\t2784000 !CAL Calibration points:  \n',
 'MSG\t2784000 !CAL -29.4, -23.5        -0,     -2   \n',
 'MSG\t2784000 !CAL -29.3, -35.7        -0,  -1544   \n',
 'MSG\t2784000 !CAL -32.9, -10.4        -0,   1559   \n',
 'MSG\t2784000 !CAL -49.7, -23.0     -2835,     -2   \n',
 'MSG\t2784000 !CAL -10.8, -27.4      2835,     -2   \n',
 'MSG\t2784000 !CAL -48.3, -33.3     -2818,  -1544   \n',
 'MSG\t2784000 !CAL -11.0, -34.2      2818,  -1544   \n',
 'MSG\t2784000 !CAL -56.2, -9.2     -2852,   1559   \n']

Next, we added an experimental marker that was sent as the experiment was started. This marker was called experiment_start. Hence, we can remove all events before this marker.

[9]:
experiment_start_index=np.where(["experiment_start" in ev for ev in events])[0][0]
events=events[experiment_start_index+1:]
events[0:10]
[9]:
['MSG\t3387245 C_GW_1_1_UD_UD\n',
 'MSG\t3390421 F_GW_1_1_10_0\n',
 'MSG\t3392759 C_NW_1_2_UD_UD\n',
 'MSG\t3394293 R_NW_1_2_UD_UD\n',
 'MSG\t3395952 F_NW_1_2_-1_0\n',
 'MSG\t3397974 C_NA_1_3_UD_UD\n',
 'MSG\t3399892 R_NA_1_3_UD_UD\n',
 'MSG\t3400999 F_NA_1_3_-11_0\n',
 'MSG\t3403206 C_GA_1_4_UD_UD\n',
 'MSG\t3404640 R_GA_1_4_UD_UD\n']

This is in a format where we can convert it into a pandas.DataFrame object for further processing.

[10]:
df_ev=pd.DataFrame([ev.split() for ev in events])
df_ev
[10]:
0 1 2 3 4 5 6 7 8
0 MSG 3387245 C_GW_1_1_UD_UD None None None None None None
1 MSG 3390421 F_GW_1_1_10_0 None None None None None None
2 MSG 3392759 C_NW_1_2_UD_UD None None None None None None
3 MSG 3394293 R_NW_1_2_UD_UD None None None None None None
4 MSG 3395952 F_NW_1_2_-1_0 None None None None None None
... ... ... ... ... ... ... ... ... ...
1065 MSG 5893078 V_UD_UD_16_UD_UD None None None None None None
1066 MSG 5899076 V_UD_UD_17_UD_UD None None None None None None
1067 MSG 5905073 V_UD_UD_18_UD_UD None None None None None None
1068 MSG 5911072 V_UD_UD_19_UD_UD None None None None None None
1069 MSG 5917071 V_UD_UD_20_UD_UD None None None None None None

1070 rows × 9 columns

In this table, the second column contains the time-stamp (identical to the time-stamp in the samples file), and the third column contains our custom markers (the format like “C_GW_1_1_UD_UD” and so on is specific for our experimental design). There are many more columns which seem to contain no information in our samples. Let’s check what those columns are for by printing the rows in our data-frame where these columns are not None:

[11]:
df_ev[np.array(df_ev[4])!=None].head()
[11]:
0 1 2 3 4 5 6 7 8
209 MSG 3900393 RECCFG CR 500 2 1 LR None
211 MSG 3900393 GAZE_COORDS 0.00 0.00 1919.00 1079.00 None None
212 MSG 3900393 THRESHOLDS L 56 231 R 66 239
213 MSG 3900393 ELCL_WINDOW_SIZES 176 188 0 0 None None
215 MSG 3900393 ELCL_PROC CENTROID (3) None None None None

Apparently, there are more eye-tracker specific markers in our files (in this case due to drift-checks during the experiments). We can safely drop those from our set of interesting events by dropping all rows in which the fourth column is not None and then dropping all non-interesting columns.

[12]:
df_ev=df_ev[np.array(df_ev[4])==None][[1,2]]
df_ev.columns=["time", "event"]
df_ev
[12]:
time event
0 3387245 C_GW_1_1_UD_UD
1 3390421 F_GW_1_1_10_0
2 3392759 C_NW_1_2_UD_UD
3 3394293 R_NW_1_2_UD_UD
4 3395952 F_NW_1_2_-1_0
... ... ...
1065 5893078 V_UD_UD_16_UD_UD
1066 5899076 V_UD_UD_17_UD_UD
1067 5905073 V_UD_UD_18_UD_UD
1068 5911072 V_UD_UD_19_UD_UD
1069 5917071 V_UD_UD_20_UD_UD

1035 rows × 2 columns

Finally, we can pass those event-markers into our PupilData-object.

[13]:
d=pp.PupilData(df.right_p, time=df.time, event_onsets=df_ev.time, event_labels=df_ev.event, name="test")
d
> Filling in 5 gaps
[32.35   4.012  6.21   2.02   1.862] seconds
[13]:
PupilData(test, 135.5MiB):
 n                 : 1268585
 nmiss             : 212551
 perc_miss         : 16.75496714843704
 nevents           : 1035
 nblinks           : 0
 ninterpolated     : 0
 blinks_per_min    : 0.0
 fs                : 500.0
 duration_minutes  : 42.28616666666667
 start_min         : 56.431666666666665
 end_min           : 98.7178
 baseline_estimated: False
 response_estimated: False
 History:
 *

The summary of the dataset shows us that the eyetracker started recording at time=56.4 minutes. We can reset the time index to start with 0 by using the reset_time() function.

[14]:
d=d.reset_time().blinks_detect()

Now we can store away this dataset in pypillometry-format and use all the pypillometry-functions on it, e.g., plot a minute of this dataset.

[16]:
d.sub_slice(4, 6, units="min").drop_original().write_file("../data/test.pd")
[15]:
plt.figure(figsize=(15,5));
d.plot((4, 5), units="min")
../_images/docs_importdata_30_0.png

Generalize to multiple similar datasets

Now that we have successfully found a way to create our PupilData structure from the raw .EDF files, we can wrap the code from this notebook into an easily accessible function that creates PupilData objects for a given .EDF file that has the same structure.

We simply create a function that takes the name of an EDF-file as input and runs all the code above, returning the final PupilData object. For convenience, we will assume that the EDF2ASC utility has already run such that .asc files are already available (see above for details).

[42]:
datapath="../data" ## this is where the datafiles are located

def read_dataset(edffile):
    basename=os.path.splitext(edffile)[0] ## remove .edf from filename
    fname_samples=os.path.join(datapath, basename+"_samples.asc")
    fname_events=os.path.join(datapath, basename+"_events.asc")

    print("> Attempt loading '%s' and '%s'"%(fname_samples, fname_events))
    ## read samples-file
    df=pd.read_table(fname_samples, index_col=False,
                  names=["time", "left_x", "left_y", "left_p",
                         "right_x", "right_y", "right_p"])

    ## read events-file
    # read the whole file into variable `events` (list with one entry per line)
    with open(fname_events) as f:
        events=f.readlines()

    # keep only lines starting with "MSG"
    events=[ev for ev in events if ev.startswith("MSG")]
    # remove events before experiment start
    experiment_start_index=np.where(["experiment_start" in ev for ev in events])[0][0]
    events=events[experiment_start_index+1:]

    # re-arrange as described above
    df_ev=pd.DataFrame([ev.split() for ev in events])
    df_ev=df_ev[np.array(df_ev[4])==None][[1,2]]
    df_ev.columns=["time", "event"]

    # create `PupilData`-object
    d=pp.PupilData(df.right_p, time=df.time, event_onsets=df_ev.time, event_labels=df_ev.event, name=edffile)
    return d

We can test this code by simply running the function with a certain filename located in datapath:

[43]:
read_dataset("test.edf")
> Attempt loading '../data/test_samples.asc' and '../data/test_events.asc'
> Filling in 5 gaps
[32.35   4.012  6.21   2.02   1.862] seconds
[43]:
PupilData(test.edf, 135.5MiB):
 n                 : 1268585
 nmiss             : 212551
 perc_miss         : 16.75496714843704
 nevents           : 1035
 nblinks           : 0
 ninterpolated     : 0
 blinks_per_min    : 0.0
 fs                : 500.0
 duration_minutes  : 42.28616666666667
 start_min         : 56.431666666666665
 end_min           : 98.7178
 baseline_estimated: False
 response_estimated: False
 History:
 *

Storing/Loading several datasets

So now it is easy to read a set of datasets into a Python list from the same experimental setup with a simple loop, e.g.,

[46]:
files=["test.edf", "test2.edf", "test3.edf"]
datasets=[read_dataset(fname) for fname in files]
> Attempt loading '../data/test_samples.asc' and '../data/test_events.asc'
> Filling in 5 gaps
[32.35   4.012  6.21   2.02   1.862] seconds

After that, we might want to save the final PupilData objects as .pd files that can be readily loaded back. Here, we loop through the list of datasets and store each of them in separate files using the name attribute of the object as filename.

[47]:
for ds in datasets:
    fname=os.path.join(datapath, ds.name+".pd")
    ds.write_file(fname)

These datasets can be read back using the PupilData.from_file() method:

[58]:
# all filenames in `datapath` that end with `.pd`
pd_files=[fname for fname in os.listdir(datapath) if fname.endswith(".pd")]
datasets=[]
for fname in pd_files:
    fname=os.path.join(datapath, fname)
    d=pp.PupilData.from_file(fname)
    datasets.append(d)

It is also possible to store the whole list as a single file by using the pd_write_pickle()-function:

[60]:
pp.pd_write_pickle(datasets, "full_dataset.pd")

which can be read-back using the pd_read_pickle() function like so:

[61]:
datasets=pp.pd_read_pickle("full_dataset.pd")
This file was created from the following Jupyter-notebook: docs/importdata.ipynb
Interactive version: Binder badge