How to use crowsetta with your own annotation format#

This section shows you how to use crowsetta for working with your own annotation format for vocalizations (or some other format not currently built into the library).

Steps to using crowsetta with your own annotation format#

Below we’ll walk through a case study for using crowsetta with your annotation format. Here’s an outline of the steps we’ll go through:

  1. get your annotations into some variables in Python (maybe you already wrote code to do this)

  2. write a class that represents your format

  3. register your new format by calling crowsetta.register_format

  4. use your format with the crowsetta.Transcriber class to work with your annotations

If writing a class to represent your format sounds difficult, don’t worry. We show you how and crowsetta gives you tools that make it easy. If you’ve written some Python code to work with your annotations before then you’re probably already halfway done.

In this walkthrough we show you how to write a sequence-like format. By sequence-like, we mean a format that can be represented as a sequence of segments, where each segment has a start time, stop time, and label. Many common annotation formats for animal vocalizations are sequence-like.

Note

Sequence-like formats are one of the two types in crowsetta. The other is bounding-box formats, which annotate sound events with boxes, as the name implies. The steps for working with your own bounding box-like format would be exactly the same, except below where we talk about crowsetta.Sequences, you should replace that with crowsetta.Bboxes.

Case Study: the BatLAB format#

Let’s say you work in the Schumacher lab, studying bat vocalizations. The lab research specialist, Alfred, has spent years writing an application in Labview to capture bat calls, called SoNAR (“Sound and Neural Activity Recorder”). Alfred has also written a GUI in MATLAB called BatLAB that lets you interactively annotate audio files containing the bats’ calls, and saves the annotations in .mat (MATLAB data) files.

You’ve started to work with Python to analyze your data, because you like the data science and machine learning libraries. However, you find yourself writing the same code over and over again to unpack the annotations from the .mat files made by BatLAB. Every time you use the code for a new analysis, you have to modify it slightly. The code has some weird, hard-to-read lines to deal with the complicated MATLAB structs created by BatLAB and how they load into Python. The code also has several repetitive steps to deal with the idiosyncracies of how SoNAR and BatLAB save data: unit conversion, data types, etcetera. You can’t change BatLAB or SoNAR though, because that’s Alfred’s job, and everyone else’s code that was written ten years ago (and still works!) expects those idiosyncracies.

You know that it’s a good idea to turn the code you wrote into a function (because you took part in a Software Carpentry workshop and then you read this paper.) You figured out which bits of the code will be common to all your projects and you make that into a function, called parse_batlab_mat. At first you just copy and paste it into all your projects. Then you decide you also want to save everyone else in your lab the effort of writing the same code, so you put the script on your lab’s Github page. This is a step in the right direction, although parse_batlab_mat gives you back a Python list of dicts, and you end up typing a lot of things like:

labels = annot_list[0]['seg_type']
onsets = annot_list[0]['seg_onsets']
offsets = annot_list[0]['seg_offsets']

Typing all those very similar ['keys'] in particular gets kind of annoying and makes you wonder if you should spend your vacation learning how to use one of those hacker text editors like vim.

But before you can worry about that, you get back reviews of your paper in PLOS Comp. Bio. called “Pidgeon Bat: Emergence of Dialects in Colonies of Multiple Bat Species”. Reviewer #3 doesn’t buy your conclusions (and you are pretty sure from the way they write that it is Oswald Cobblepot, professor emeritus of ethology at Metropolitan University of Fruitville, Florida, and author of the seminal review from 1982, “Bat Calls: A Completely Innate Behavior Encoded Genetically”). You want to share your data with the world, mainly to mollify reviewer #3. The problem is that this reviewer (if he is who you think he is) only knows how to write Fortran code and is definitely not going to figure out how to copy and use your function parse_batlab_mat so he can run your analysis scripts and reproduce your figures for himself.

What you really want is to share your data and write your code in a way that doesn’t depend on anyone knowing anything about BatLAB orSoNAR and how those programs save data and annotations. This is where crowsetta comes to your rescue.

Okay, now that we’ve set up some background for our case study, let’s go through the steps we outlined above.

1. get your annotation into some variables in Python#

Let’s look at this complicated data structure that we have our annotation in. For this tutorial you’ll need the file bat1_annotation.mat that you should be able to download from this link or by going to vocalpy/crowsetta.

The BatLAB GUI saves annotation into these annotation.mat files, with two variables in each mat file:

  • filenames: a vector where each element is the name of an audio file

  • annotations: a struct that has a record for each element in filenames, and that record is the annotation corresponding to the audio file with the same index in filenames

The following snippet will let you load and inspect the data:

from scipy.io import loadmat
bat1_annotation = loadmat('bat1_annotation.mat')
print('Variables in .mat file:',
      [var for var in list(bat1_annotation.keys())
       if not var.startswith('__')],'\n'
     )
print(f"First 3 filenames:\n{bat1_annotation['filenames'][:,:3]}\n")
print(f"First annotation:\n{bat1_annotation['annotations'][:,0]}")
Variables in .mat file: ['filenames', 'annotations'] 

First 3 filenames:
[[array(['lbr3009_0005_2017_04_27_06_14_46.wav'], dtype='<U36')
  array(['lbr3009_0006_2017_04_27_06_14_57.wav'], dtype='<U36')
  array(['lbr3009_0007_2017_04_27_06_15_07.wav'], dtype='<U36')]]

First annotation:
[array([[(array([[0.00297619, 0.279125  , 0.55564729, 0.62654167, 0.68429167,
                 0.73929167, 0.79429167, 0.85020833, 0.906125  , 0.96479167,
                 1.02345833, 1.07754167, 1.128875  , 1.19579167, 1.25354167]]), array([[0.14150433, 0.504625  , 0.59629167, 0.64945833, 0.70445833,
                 0.75945833, 0.83004167, 0.884125  , 0.94095833, 1.013375  ,
                 1.06654167, 1.11156764, 1.17654167, 1.23154167, 1.29020833]]), array([[1, 1, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]], dtype=uint8), array([[48000]]))]],
       dtype=[('segFileStartTimes', 'O'), ('segFileEndTimes', 'O'), ('segType', 'O'), ('fs', 'O')])                                                                      ]

Below is the code you wrote to unpack the .mat files. Like we said above, the code has some weird, hard-to-read lines to deal with the way that the complicated MATLAB structs created by BatLAB load into Python, such as calling tolist() just to unpack an array, and some logic to make sure the labels get loaded correctly into a numpy array. And the code has several repetitive steps to deal with the idiosyncracies of SoNAR and BatLAB, like converting the start and stop times of the calls from seconds back to Hertz so you can find those times in the raw audio files.

# %load parsebat.py
import numpy as np
from scipy.io import loadmat


def parse_batlab_mat(mat_file):
    """parse batlab annotation.mat file"""
    mat = loadmat(mat_file, squeeze_me=True)
    annot_list = []
    # annotation structure loads as a Python dictionary with two keys
    # one maps to a list of filenames,
    # and the other to a Numpy array where each element is the annotation
    # coresponding to the filename at the same index in the list.
    # We can iterate over both by using the zip() function.
    for filename, annotation in zip(mat['filenames'], mat['annotations']):
        # below, .tolist() does not actually create a list,
        # instead gets ndarray out of a zero-length ndarray of dtype=object.
        # This is just weirdness that results from loading complicated data
        # structure in .mat file.
        seg_start_times = annotation['segFileStartTimes'].tolist()
        seg_end_times = annotation['segFileEndTimes'].tolist()
        seg_types = annotation['segType'].tolist()
        if type(seg_types) == int:
            # this happens when there's only one syllable in the file
            # with only one corresponding label
            seg_types = np.asarray([seg_types])  # so make it a one-element list
        elif type(seg_types) == np.ndarray:
            # this should happen whenever there's more than one label
            pass
        else:
            # something unexpected happened
            raise ValueError("Unable to load labels from {}, because "
                             "the segType parsed as type {} which is "
                             "not recognized.".format(filename,
                                                      type(seg_types)))
        samp_freq = annotation['fs'].tolist()
        seg_start_times_Hz = np.round(seg_start_times * samp_freq).astype(int)
        seg_end_times_Hz = np.round(seg_end_times * samp_freq).astype(int)
        annot_dict = {
            'audio_file': filename,
            'seg_types': seg_types,
            'seg_start_times': seg_start_times,
            'seg_end_times': seg_end_times,
            'seg_start_times_Hz': seg_start_times_Hz,
            'seg_end_times_Hz': seg_end_times_Hz,
        }
        annot_list.append(annot_dict)

    return annot_list

When it runs on a file, you end up with an annot_list where each item in the list is an annot_dict that contains the annotations for a file, like this:

annot_dict = {
    'seg_types': array([1, 1, 5, 2, ...]),
    'seq_start_times': array([0.00297619, 0.279125, 0.55564729,... ]),
    ... # end times, start and end times in Hertz
}

Again, as we said above, you turned your code into a function to make it easier to use across projects:

import numpy as np
from scipy.io import loadmat

def parse_batlab_mat(mat_file):
    """parse batlab annotation.mat file"""
    # code from above
    return annot_list

As we’ll see in a moment, all you need to do is take this code you already wrote, and instead of returning your list of dicts, you return a list of Sequences.

2. write a class that represents your format#

Since we are dealing with a sequence-like format, what we ultimately want to do is convert the format into a crowsetta.Sequence, so that it acts like all the other sequence-like formats.

So before we even worry about writing a class, let’s see how to make some Sequences.

Use one of the Sequence “factory functions” to conveniently turn annotations in your format into Sequences#

First, to get the Sequence, we’ll use a “factory function”. That means it’s a function built into the Sequence class that gives us back a new instance of a Sequence. One such factory function is Sequence.from_keyword. Here’s an example of using it with our parsebat code from above:

from parsebat import parse_batlab_mat
from crowsetta.sequence import Sequence

# you, using the function you already wrote
annot_list = parse_batlab_mat(mat_file='bat1_annotation.mat')

# you have annotation from one file in an "annot_dict"
annot_dict = annot_list[0]

a_sequence = Sequence.from_keyword(labels=annot_dict['seg_types'],
                                   onsets_s=annot_dict['seg_start_times'],
                                   offsets_s=annot_dict['seg_end_times'],
                                   onset_samples=annot_dict['seg_start_times_Hz'],
                                   offset_samples=annot_dict['seg_end_times_Hz'])
print("a_sequence:\n", a_sequence)
a_sequence:
 <Sequence with 15 segments>

Okay now that we saw how to make a crowsetta.Sequence from our annotations loaded into Python and numpy data types, let’s actually start writing our class.

Start to write the class#

Note

If the idea of writing a class is completely new to you, we suggest reading up on that first.

A very good place to start would be Think Python by Alan Downey, in particular the chapter on Classes and objects

The first thing we want to do is just sketch out a class that will represent the annotations loaded into good old-fashioned Python data types. That way when we make a new instance of this class, it will contain the annotations we loaded from a single file. To make a class that holds our data, we use the attrs library. Then we later add a couple methods to this class that do things like turn the annotations into crowsetta.Sequences and crowsetta.Annotations.

To start writing the class, copy one of the existing classes in crowsetta, by looking at its code.
Here’s a stub of a Batlab class that we wrote by copying the 'yarden' format and changing a couple things – we’ll explain what we changed below.

# %load -r 1-10,14-48 batlab.py
import pathlib
from typing import ClassVar

import attr
import numpy as np
import scipy.io

from crowsetta import Sequence, Annotation
from crowsetta.typing import PathLike
import crowsetta
@crowsetta.interface.SeqLike.register
@attr.define
class Batlab:
    """Example custom annotation format"""
    name: ClassVar[str] = 'batlab'
    ext: ClassVar[str] = '.mat'

    annotations: np.ndarray = attr.field(eq=attr.cmp_using(eq=np.array_equal))
    audio_paths: np.ndarray = attr.field(eq=attr.cmp_using(eq=np.array_equal))
    annot_path: pathlib.Path = attr.field(converter=pathlib.Path)

    @classmethod
    def from_file(cls,
                  annot_path: PathLike):
        """load BatLAB annotations from .mat file

        Parameters
        ----------
        mat_path : str, pathlib.Path
        """
        annot_path = pathlib.Path(annot_path)
        crowsetta.validation.validate_ext(annot_path, extension=cls.ext)

        annot_mat = scipy.io.loadmat(annot_path, squeeze_me=True)

        audio_paths = annot_mat['filenames']
        annotations = annot_mat['annotations']
        if len(audio_paths) != len(annotations):
            raise ValueError(
                f'list of filenames and list of annotations in {mat_path} do not have the same length'
            )

        return cls(annotations=annotations,
                   audio_paths=audio_paths,
                   annot_path=annot_path)

This might seem overwhelming at first, but we only changed a few things.

To tell crowsetta how to handle our format, we need to change exactly two of the class attributes: (1) the name and (2) the ext.

The name attribute is the shorthand string name that we use to refer to our format, for example when we call crowsetta.format.by_name or we make a new Transcriber, passing in this name as the format argument (like so: scribe = crowsetta.Transcriber(format='name')).

The ext attribute tells crowsetta what a valid file extension is for this annotation format: is it a .mat file or a .csv file? Can it be both ('txt', 'csv')? We then use this attribute in other places in the class, like when we write a from_file method, to validate the file name that gets passed into that method.

Re-writing the from_file method is the other thing we changed. Notice that this from_file method is basically the few lines from our parse_batlab function, where we unpack the filenames and the annotations from the .mat file. We don’t loop through them to put them in a Python dict though. Instead we assign them to the class’ attributes audio_paths and annotations. You don’t have to completely understand what’s going on here; basically we are writing our own “factory function” (like the one we used for Sequences above) that gives us a new instance of our Batlab class that will have the specific annotations from a file loaded into it. (Writing a factory method is consistent with advice from the attrs docs). To achieve this we use the @classmethod decorator and pass in cls as the first argument (by convention) that we can then call to create a new instance. (To learn more about classmethods see https://realpython.com/instance-class-and-static-methods-demystified/).

Now write a to_seq method that converts the annotations to crowsetta.Sequences#

Again, you pretty much already wrote this. Just take your parse_batlab_mat function from above and change a couple lines. First, you’re going to return a list of sequences instead of your annot_list from before. You probably want to make that explicit in your function.

# %load -r 51-96 batlab.py
def to_seq(self):
    """unpack BatLAB annotation into list of Sequence objects

        example of a function that unpacks annotation from
        a complicated data structure and returns the necessary
        data as a Sequence object

        Returns
        -------
        seqs : list
            of Sequence objects
        """
        seqs = []
        # annotation structure loads as a Python dictionary with two keys
        # one maps to a list of filenames,
        # and the other to a Numpy array where each element is the annotation
        # coresponding to the filename at the same index in the list.
        # We can iterate over both by using the zip() function.
        for filename, annotation in zip(self.audio_paths, self.annotations):
            # below, .tolist() does not actually create a list,
            # instead gets ndarray out of a zero-length ndarray of dtype=object.
            # This is just weirdness that results from loading complicated data
            # structure in .mat file.
            onsets_s = annotation['segFileStartTimes'].tolist()
            offsets_s = annotation['segFileEndTimes'].tolist()
            labels = annotation['segType'].tolist()
            if type(labels) == int:
                # this happens when there's only one syllable in the file
                # with only one corresponding label
                seg_types = np.asarray([seg_types])  # so make it a one-element list
            elif type(labels) == np.ndarray:
                # this should happen whenever there's more than one label
                pass
            else:
                # something unexpected happened
                raise ValueError("Unable to load labels from {}, because "
                                 "the segType parsed as type {} which is "
                                 "not recognized.".format(audio_path,
                                                          type(seg_types)))
            samp_freq = annotation['fs'].tolist()

            seq = Sequence.from_keyword(labels=labels,
                                        onsets_s=onsets_s,
                                        offsets_s=offsets_s)
            seqs.append(seq)
        return seqs

Then at the end of your main loop, instead of making your annot_dict, you’ll make a new Sequence from each file using the from_keyword factory function, append the new Sequence to your seq_list, and then finally return that list of Sequences.

# %load -r 92-95 batlab.py
seq = Sequence.from_keyword(labels=labels,
                            onsets_s=onsets_s,
                            offsets_s=offsets_s)
seqs.append(seq)

If this still feels too wordy and repetitive for you, you can put segFileStartTimes, segFileEndTimes, et al., into a Python dict with keys corresponding to the parameters for Segment.from_keyword:

annot_dict = {
    'file': filename,
    'onsets_s': annotation['segFileStartTimes'].tolist(),
    'offsets_s': annotation['segFileEndTimes'].tolist()
    'labels': seg_types
}

Note here that you only have to specify the onsets an offsets of segments either in seconds or in sample number (but you can specify both if you want).

and then use another factory function, Sequence.from_dict, to create the Sequence.

seq_list.append(Sequence.from_dict(annot_dict))

Now you have a Batlab class with a to_seq function, that takes annotation files and return Sequences. You want to and put this in a file that ends with .py, e.g., batlab.py (otherwise known as a Python module). To see the entire example, check out the batlab.py file (and compare it with parsebat.py).

3. Register your new format by calling crowsetta.register_format#

To make it so that crowsetta knows about your format, you call the crowsetta.register_format function and pass in the class you have written.

After you do this, you should see that the shorthand string name that you defined for the class appears in the list returned when you call crowsetta.formats.as_list().

import crowsetta
import batlab

crowsetta.register_format(batlab.Batlab)

crowsetta.formats.as_list()

formats_list = crowsetta.formats.as_list()
assert batlab.Batlab.name in formats_list  # no AsssertionError, because `batlab` is in the list

Use crowsetta.register_format as a decorator#

Instead of calling crowsetta.register_format “manually”, you can use it as a decorator, that causes it to get registered automatically when you import the module.

To use crowsetta.register_format as a decorator, we write it with the @ symbol at the top of our class:

@crowsetta.formats.register_format
@crowsetta.interface.SeqLike.register
@attr.define
class Batlab:
    """Example custom annotation format"""
    name: ClassVar[str] = 'batlab'
    ext: ClassVar[str] = '.mat'

(If you’re unfamiliar with decorators, check out this primer: https://realpython.com/primer-on-python-decorators/)

This works because decorators are executed at import time. Notice we also used other decorators, another from crowsetta that registers our class as sequence-like, and one from attrs that helps us easily define a class. You need to make sure that reigster_format the outermost decorator, because decorators are executed “inside-out”.

Here’s just the top few lines from an updated version of our class where we apply crowsetta.register_format as a decorator.

# %load -r 1-22 batlab.py
import pathlib
from typing import ClassVar

import attr
import numpy as np
import scipy.io

from crowsetta import Sequence, Annotation
from crowsetta.typing import PathLike
import crowsetta


@crowsetta.formats.register_format
@crowsetta.interface.SeqLike.register
@attr.define
class Batlab:
    """Example custom annotation format"""
    name: ClassVar[str] = 'batlab'
    ext: ClassVar[str] = '.mat'

    annotations: np.ndarray = attr.field(eq=attr.cmp_using(eq=np.array_equal))
    audio_paths: np.ndarray = attr.field(eq=attr.cmp_using(eq=np.array_equal))

4. Use your format with the crowsetta.Transcriber class to work with your annotations#

If you have worked with Crowsetta already, or gone through the tutorial, you know that we can work with a Transcriber that does the work of making Sequences of Segments from annotation files for us. We create a new instance of a Transcriber by writing something like this:

scribe = crowsetta.Transcriber(format='name')

You will do the same thing here:

scribe = crowsetta.Transcriber(format='batlab')

Here’s what it looks like to do all of that in a few lines of code:

import crowsetta
import batlab  # gets registered automatically when we import, because of decorator

scribe = crowsetta.Transcriber(format='batlab')

seq_list = scribe.from_file('bat1_annotation.mat').to_seq()

And now, just like you do with the built-in formats, you get back a list of Sequences from your format:

print(f'First item in seq_list: {seq_list[0]}')
print(f'First segment in first sequence:\n{seq_list[0].segments[0]}')
First item in seq_list: <Sequence with 15 segments>
First segment in first sequence:
Segment(label='1', onset_s=0.0029761904761904934, offset_s=0.14150432900432905, onset_sample=None, offset_sample=None)

Summary#

Now you have seen in detail the process of working with your own annotation format in Crowsetta. Here’s a review of the steps, with some code snippets worked in to tie it all together:

  1. get your annotations into some variables in Python (maybe you already wrote code to do this)

  2. write a class that represents your format

  3. register your new format by calling crowsetta.register_format

  4. use your format with the crowsetta.Transcriber class to work with your annotations

steps 1-3 will give you something like this in a file named something like myformat.py

import pathlib
from typing import ClassVar

import attr

from crowsetta import Sequence, Annotation
from crowsetta.typing import PathLike
import crowsetta


@crowsetta.formats.register_format
@crowsetta.interface.SeqLike.register
@attr.define
class MyFormat:
    """Example custom annotation format"""
    name: ClassVar[str] = 'myformat'
    ext: ClassVar[str] = '.csv'

    ...
    
    @classmethod
    def from_file(cls, annot_path):
        ...
        return cls(annotations, annot_path)

    ...
    def to_seq():
        seq_list = []
        for annotation in self.annotations:
        # load annotation into some Python variables, e.g. a dictionary
            annot_dict = magic_annotation_unpacking_function(annotation)
            seq = Sequence.from_dict(annot_dict)
            seq_list.append(seq)
        return seq_list
    
    def to_annot():
        seqs = self.to_seq()
        annots = []
        for seq in seqs:
            annots.append(Annotation(seq=seq, annot_path=self.annot_path))
        return annots

and then as in step 4, you will be able to make a Transcriber that knows to use this class when you tell it you want to turn your annotation files into Sequences or Annotations.

scribe = crowsetta.Transcriber(format='myformat')
myformat = scribe.from_file('my-annotations.txt')
seq_list = myformat.to_seq()