Abstract
In the development of science, accurate and reproducible documentation of the experimental process is crucial. Automatic recognition of the actions in experiments from videos would help experimenters by complementing the recording of experiments. Towards this goal, we propose FineBio, a new fine-grained video dataset of people performing biological experiments. The dataset consists of multi-view videos of 32 participants performing mock biological experiments with a total duration of 14.5 hours. One experiment forms a hierarchical structure, where a protocol consists of several steps, each further decomposed into a set of atomic operations. The uniqueness of biological experiments is that while they require strict adherence to steps described in each protocol, there is freedom in the order of atomic operations. We provide hierarchical annotation on protocols, steps, atomic operations, object locations, and their manipulation states, providing new challenges for structured activity understanding and hand-object interaction recognition. To find out challenges on activity understanding in biological experiments, we introduce baseline models and results on four different tasks, including (i) step segmentation, (ii) atomic operation detection (iii) object detection, and (iv) manipulated/affected object detection. Dataset and code are available from https://github.com/aistairc/FineBio.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In the development of science, accurate and reproducible documentation of the experimental process is crucial. Biological experiments are a perfect example of this. Reporting accurate materials, procedures, and results not only guarantees the correctness of the experiments but also becomes reliable evidence to promote new findings (Howe et al., 2008) and laboratory automation (Holland and Davies, 2020). For desirable documentation of biological experiments, it is necessary to record what reagents are used, in what quantities, and by what operations, without omissions. But if there are errors or oversights in keeping the record, it will be hard to reproduce the results.
In this work, we aim to help researchers by automatically recognizing actions in experiments from video. Specifically, we study the problem of recognizing various steps and operations in biological experiments from videos by collecting a new dataset of a person performing molecular biological experiments. While few works collected footage of biological experiments (Nishimura et al., 2021; Grauman et al., 2022), their annotations were only provided at the step level (e.g.,, add 1 mL of PBS), and the amount of data was not enough to perform a quantitative evaluation by training models to recognize different steps and operations. We propose FineBio, a controlled yet challenging set of biological experiment videos and annotation with multiple spatio-temporal granularity: steps, atomic operations, object bounding boxes, and their manipulation states (Figure 1). We assume that one experiment follows a single protocol. A protocol consists of several steps, and each step gives strict instructions to follow. When performing one step, it is further decomposed into a set of atomic hand-related operations, such as “inserting a pipette into a cell culture plate”. Furthermore, such atomic operations occur from changes in hand-object relations (e.g.,, what object is manipulated by hand or influenced by other objects), which forms a hierarchical structure of an experiment. Instead of focusing on one of the above, we aim to provide annotation at all levels to facilitate a holistic understanding of people performing experiments. We record 226 trials totaling 14.5 hours and collect annotations of 3.5K steps, 50K atomic operations, and 72K object bounding boxes. This enables us to train a model to recognize activities and objects.
Following the latest works in video datasets (Sener et al., 2022; Grauman et al., 2024), we offer synchronized egocentric view and multi-view third-person view videos. Incorporating multiple third-person views complements the egocentric perspective by providing a holistic view of the workspace and addressing limitations such as occlusions or restricted fields of view. This setup can be leveraged to study cross-view learning, improved generalization across camera positions, and 3D understanding of hand-object interactions. It also provides insights into finding optimal camera configurations regarding real-world setups in laboratories. While the baseline experiments focus on egocentric data, the third-person views offer additional opportunities for future research.
To find out challenges in activity understanding in biological experiments, we introduce baseline models and their results in four tasks: (i) step segmentation (Section 4.1) (ii) atomic operation detection (Section 4.2) (iii) object detection (Section 4.3), and (iv) manipulated/affected object detection (Section 4.4). The results show that the baseline models struggle in recognizing/detecting fine-grained details such as temporal boundaries and objects under interactions.
While the FineBio dataset employs mock setups to facilitate dense and diverse annotations, its focus on fundamental hand-object interactions and hierarchical task structures ensures that the trained models can be used as a testbed to analyze experimenters’ hand-related tool manipulation.
Our contributions are (i) a multi-view video dataset of 32 people performing mock biological experiments (ii) fine-grained action and object annotation on steps, atomic operations, object locations, and their manipulation states (iii) baseline models and experimental results on four different tasks, suggesting the need to model activity across multiple levels.
2 Related Works
2.1 Video datasets on biological experiments
While there are many image datasets that focus on part of biological experiments (e.g.,, Edlund et al. (2021); Wei et al. (2021); Lin et al. (2021); Alam and Islam (2019)), few works collect videos on performing biological experiments. BioVL (Nishimura et al., 2021) collects 16 egocentric videos of biological experiments with step-level caption annotation. However, the number of videos is not enough to train a machine learning model and is only evaluated in a zero-shot setting. Ego4D (Grauman et al., 2022) contains 25 hours of videos performing biological experiments at a laboratory. However, the activities are uncontrolled, unstructured, and only contain coarse narrations. Due to the lack of fine-grained annotation of object manipulation, the use of Micro QR Codes was studied (Nishimura et al., 2024) but in practice, it is not desirable to require such preparation. FineBio collects videos of specific molecular biological experiments, along with fine-grained hierarchical annotation from coarse step level to fine-grained object manipulation level, that enables training models across different granularity and holistic understanding of biological experiments.
Concurrent to our work, ProBio (Cui et al., 2023) proposes a conceptually similar video dataset of molecular biology experiments. ProBio provides a three-level hierarchy of brief experiment level (corresponding to protocols), practical experiment level (corresponding to steps), and human-object interaction level (corresponding to atomic actions and object annotation) along with two benchmark tasks of transparent solution tracking and multimodal action recognition. While their multimodal action recognition task only focuses on classifying steps from cropped, individual takes, our FineBio dataset focuses on correctly temporally localizing the steps and the fine-grained atomic operations from untrimmed videos, which is a more practical and challenging task. We provide a larger or comparable amount of annotation in terms of annotated length (14.5 hours vs. 10.7 hours), number of steps (3.5K vs. 3.7K), and number of atomic operations (51K vs. 38K).
2.2 Video datasets on structural activities
Beyond recognizing individual actions, recognizing structured activities that consist of specific procedures is gaining more interest. An activity can be divided into several key steps (Alayrac et al., 2016; Bansal et al., 2022). Recognizing such key steps from videos are studied in various fields such as cooking (Kuehne et al., 2016; Papadopoulos et al., 2022), sports (Shao et al., 2020; Xu et al., 2022), assembly (Ragusa et al., 2021; Sener et al., 2022), medical science (Kaku et al., 2022), and “how-to” instructional videos including the aforesaid domains (Alayrac et al., 2016; Tang et al., 2019; Miech et al., 2019; Zhukov et al., 2019).
A key step can be further divided into more fine-grained sub-actions. Few datasets provide annotation on multi-level temporal action annotation (Zhao et al., 2019; Shao et al., 2020; Rai et al., 2021; Sener et al., 2022; Song et al., 2023). FineGym (Shao et al., 2020) is a dataset built on gymnastic videos with temporal annotation of three levels but does not incorporate interactions with the objects. Assembly101 (Sener et al., 2022) offers a two-level hierarchy of coarse and fine-grained actions but differs in purpose in that the order of action is uncontrolled and left to participants, and lacks object-level annotation. Our FineBio dataset offers comprehensive multi-level annotation from coarse yet strict step-level actions, sub-second atomic operations, and to frame-level object and manipulation state annotation in the practically important biology domain, providing a new challenge in recognizing actions and interacting objects across different temporal scales.
2.3 Hand-object relation modeling
Predicting manipulated (active) objects and their manipulation states (e.g.,, physical contact) during hand manipulation is crucial in understanding hand-related activities. Various tasks such as detecting object-in-contact (Shan et al., 2020), contact prediction (Narasimhaswamy et al., 2020; Yagi et al., 2021), hand-object segmentation (Shan et al., 2021; Darkhalil et al., 2022; Zhang et al., 2022), active object detection (Ragusa et al., 2021; Fu et al., 2022), and action scene graph generation (Rodin et al., 2024) are studied. However, most works focus on frame-level or short-term interactions, and annotation is not associated with action understanding tasks with a few exceptions (Darkhalil et al., 2022). FineBio dataset provides object bounding box annotation with object-with-contact, manipulating objects, and affected objects, aligned with atomic operation annotation. The trained hand-object relation prediction model could be used to improve action detection models and vice versa, which offers a unique value to the dataset.
2.4 Comparison with existing datasets
Table 1 compares FineBio with existing datasets. The FineBio dataset stands out for its unique focus on biological experiments, a domain characterized by strict temporal relationships between steps. Unlike activities in domains such as cooking (EK-VISOR, EgoPER) or assembly (Assembly101), biological experiments demand precise adherence to ordered protocols while allowing flexibility in the execution of fine-grained atomic operations. This hierarchical structure of strict steps and flexible hand-object interactions is crucial for reproducibility, making the biology domain unique to other domains.
FineBio provides labels at multiple granularity levels, including protocol and step annotations, as well as frame-level atomic operations and object bounding boxes. While other datasets like ProBio and EK-VISOR include HOI annotations, ProBio lacks a multi-view setup and fine-grained temporal labels. Instructional/procedural datasets (FineGym, BioVL, Ego4D goal-Step, EgoPER, EASG) lack hierarchical annotations or fail to provide detailed object and manipulation state information. FineBio uniquely captures detailed manipulation states and interactions across six synchronized camera views, enabling a deeper exploration of hand-object dynamics.
3 FineBio Dataset
3.1 Video Collection
3.2 Protocols
In this study, we define protocol as a set of instructions describing an experiment with a specific objective to be achieved. FineBio consists of videos where a person is performing mock molecular biological experiments following specific protocols, in front of a table (see Figure 1 left).
Specifically, we study four experiments on (i) lysis and recovery of cultured cells (ii) DNA extraction with magnetic beads (iii) polymerase chain reaction (PCR), and (iv) DNA extraction with spin columns, each taken from a popular procedure in amplification of DNA in cells. From these experiments, we collect seven protocols consisting of multiple steps by changing the number of iterations of some steps (e.g.,, the iterations of ethanol wash).
In actual biological experiments, special equipment (e.g.,, PCR machine, autoclave) is sometimes required and has waiting time till the chemical reaction process finishes. Although these are crucial aspects in biological experiments, their waiting time is tedious, making the analysis complicated, and special equipment may hinder the challenge of activity understanding by their device-specific operations. To focus on analyzing experimenters’ hand-related tool manipulation, we modify some of the steps from actual protocols as follows: (a) use primary tools to focus on hand operations (b) use distilled water instead of real reagents (c) omit waiting time to centrifuge or purify molecules.
For example, we directed participants to take only a few seconds for vortex, spindown, and adsorption by magnetic beads, which may not be effective in real experiments. The end of the PCR experiment that requires using the PCR machine was replaced by just putting the 8-tube stripes on a silver tube rack since it does not include meaningful hand operations. We used distilled water instead of all reagents, thus it was not possible to observe appearance change due to chemical reactions.
We emphasize that opting for mock experiments does not lose the essence of recognizing hand-related operations that appear in real biological experiments. Simplifying the protocols enabled us to collect more dense and diverse manipulation actions without imposing an excessive burden to the participants.
3.2.1 Camera configuration
We used the GoPro HERO9 Black camera for recording. We record the experiments with five fixed third-person cameras (\(4000\times 3000\), 30 fps, linear FOV) and one head-mounted first-person camera (\(3920\times 2160\), 30 fps, wide FOV). Each third-person camera was fixed to the platform and positioned left-back, left-front, right-back, right-front, and the above, of the participant (see Figure 1 for example). All cameras are temporally synchronized by a QR timecode displayed on a monitor (\(<30~\textrm{ms}\) error), and geometrically calibrated by a checkerboard for third-person cameras and four AR markers for first-person cameras by following (Liu et al., 2021). Recording was conducted at two locations with a black working table. Each camera was set up to view the entire surface of the desk (120cm \(\times \) 60cm or 160cm \(\times \) 70cm large) and the wearer. The top-view camera was positioned about 85–90 cm high from the desk surface. Figure 4 left and center shows the example setup of each environment.
Camera and marker setup. (Left) Camera setup of the first environment (P01–P09). Red circles denote location of fixed third-person cameras. Top-view camera is out of frame. (Center) Camera setup of the second environment (P10–P32). (Right) Example of AR markers for extrinsic calibration of first-person camera. Green circles denote location of AR markers (Color figure online)
3.2.2 Calibration and synchronization
To obtain the extrinsic camera parameters of the third-person cameras, we resort to standard chessboard-based calibration (Zhang, 2000). For the first-person camera, we followed a similar procedure to StereOBJ-1M (Liu et al., 2021). Specifically, we placed three or four AR markers on the table (see Figure 4 right for example). Then, we calculate the markers’ position from the third-person cameras using Perspective-n-Point (PnP) pose computation. After that, we calculate the first-person camera’s location from the obtained marker position by triangulation.
However, in some cases, the camera positions were not perfectly obtained due to rapid head motion and severe occlusion by arms or equipment. Therefore, the percentage of frames in which one or more AR markers were detected and the camera posture was recovered was 90.4% on average.
We implement an original QR code-based temporal synchronization scheme to synchronize videos across both third-person and first-person cameras. At the beginning of each trial, we display a QR code that contains recording time, current elapsed time, participant id, protocol id, take id, etc. to efficiently manage the recordings.Footnote 1
3.2.3 Participants and recording
32 participants (16 men, 16 women) each performed 5–10 trials, resulting in 226 valid trials in total Footnote 2. Table 2 shows statistics on the number of recordings. The total recording duration was around 14.5 hours (87 hours of recording in total). The average duration per trial was 3.9 minutes (see Figure 2 left for details). In contrast to Sener et al. (2022), we retook the trials if the participant made an unrecoverable mistake or performed steps in a wrong order.
3.2.4 Objects
To detect laboratory tools from a modest amount of annotation, we standardized the equipment used in the experiments. 33 pieces of equipment were selected to cover the objects used in the protocols (see supplementary materials for the full list). Since the PCR machine was too large to be installed on the table, we substituted it with a tube rack with a distinctive appearance.
3.3 Annotation
We provide annotation at multiple granularity: steps, atomic operations, object locations, and their states.
3.3.1 Step
Steps refer to the ordered instructions described in a protocol. The experimenter must strictly follow the instructions and the order of the steps to ensure reproducibility. Therefore, the order of the steps is always the same in one protocol. We define 32 categories for our dataset (see supplementary materials for details). Some steps are shared across different protocols. We annotated the start/end time and the category of each step. In total, 3,541 steps were annotated. The average duration was 14.3 seconds, showing variation across categories (see Figure 2 center for details).
3.3.2 Atomic operation
Atomic operation is defined as a minimum meaningful hand-centric action in experiments. Although steps are strictly regulated, how to manipulate the hands and equipment to accomplish the instructions in a step may differ across participants, and such operations may occur simultaneously. Specifically, an atomic operation comprises a verb, a manipulated object, and an affected object, each assigned to each hand. Verb is defined by ten fine-grained motions such as put, take, insert, etc. Manipulated object is defined as objects directly manipulated by the hand typically accompanied by contact between the hands. We also define affected object, objects that are affected through the manipulated object (e.g.,, take a micro tube from a tube rack, insert a tip to a pipette). Affected objects may not exist in some cases. For example, collecting a sample in a micro tube to a pipette may be described in five operations: {(take, pipette, none), (press, pipette, none), (insert, pipette, micro tube), (release, pipette, none), (detach, pipette, micro tube)}. In the case of using a conjunction of two or more objects (e.g.,, pipette and tip), we regard the manipulated object upon conjunction (pipette in this case) as a primary object. We adopt Rubicon Boundaries (Moltisanti et al., 2017) for determining the start and end time of the operations. We collected 50,659 atomic operations in total. The average duration per operation was 0.91 seconds. As shown in Figure 2 right, the majority of atomic operations were instant (\(<1\) seconds), making the operation detection more challenging.
Figure 3 shows the occurrence distribution of verbs, manipulated objects, and affected objects, each split by hand side. Since most of the participants are right-handed, pipette-related actions are highly biased towards right hands, while few objects (e.g.,, micro tube, cell culture plate, tip rack) are manipulated by left hands.
3.3.3 Object location
Examples of object bounding box annotation and their manipulation states. Each figure shows example of object annotation for protocol 1 (lysis and recovery), 3 (DNA extraction with magnetic beads), 5 (PCR), and 6 (DNA extraction with spin columns) from left to right. Each color of box denotes object category. Hand contact states and object manipulation states (contact, manipulated_left/right, effect_left/right) are shown next to the object name
To holistically understand activities, we also need to be able to determine which objects are involved in an operation. Although several datasets (Ragusa et al., 2021; Damen et al., 2022) annotate only active objects that are currently involved in the interactions due to cost constraints, they may harm the performance of the object detector because other inactive objects are considered as background. Besides active objects, affected objects that are influenced through active objects are also important in analyzing activities but they were not considered in the above datasets. In the above context, we decide to annotate all the hands and objects appearing in a frame rather than only active objects. In addition, we annotate binary manipulation state information on the contact state, manipulated object, and affected object.
The contact state is assigned to each hand and is marked as true if an object has contact with the hand. Similar to the definition of active objects as objects relevant to an action, we mark objects being manipulated or influenced within the interval of an atomic operation as manipulated and affected objects, respectively. Therefore, some objects are considered manipulated even though they are not in contact (see Figure 1 right). Table 5 and 6 show the number of annotated objects and manipulated/affected objects, respectively.
3.3.4 Selection of frames to annotate
The appearance of an object may change significantly under interactions via motion and occlusion. For a good object detector, collecting various samples that reflect such difficult scenarios is important. To this end, we sample 1,935 frames from all the trials that contain novel situations (e.g.,, rare viewpoint, objects in contact, etc.) instead of uniform sampling. Specifically, we first sample 1,346 first-person frames and further selected 118 frames from them to also annotate the corresponding 589 third-person frames (118 frames \(\times \) 5 views, one omitted due to recording issue) that have the same timestamp as the first-person frames. We annotated all the objects appearing in the frames, resulting in 71,548 bounding boxes in total. We provide visual examples in Figure 7.
3.3.5 Hierarchical structure
Figure 8 shows an example annotation of a specific step. Each row shows the (first-person) video frame, task annotation, atomic operation annotation (right hand/left hand), and corresponding operation state, respectively. As shown in the figure, atomic operations appear intermittently, sometimes with temporal overlap. Although an experimenter can perform the operations at their convenient time, one operation should satisfy their prerequisites. For example, the tip rack must be opened before inserting a tip into a pipette. A simple operation such as pushing a pipette will cause different results on whether it has been performed before or after inserting the pipette into a cell culture plate (collection or injection of the solution). To recognize such small visual differences, not only looking at the holistic appearance but also capturing which objects are in contact, manipulated, or affected will be important. FineBio provides new challenges on how to develop a recognition model that can fuse information across different granularity.
3.3.6 Data splits
Finally, we divided the data by participants, resulting in 22/5/5 participants for training, validation, and testing, without overlap. Table 3 summarizes the overall statistics of our FineBio dataset.
4 Experiments
To find out challenges in holistic activity understanding in biological experiments, we introduce baseline models and report the results in four tasks: (i) step segmentation (Section 4.1) (ii) atomic operation detection (Section 4.2) (iii) object detection (Section 4.3), and (iv) manipulated/affected object detection (Section 4.4). Each task corresponds to the annotation of each level, providing a set of challenging benchmarks on recognizing longer activities of tens of seconds to frame-level object states. Although third-person videos and object annotation are also available, we primarily use first-person videos throughout this section. All the code is available at https://github.com/aistairc/FineBio.
4.1 Step Segmentation
4.1.1 Definition
Given T frames in a video, the goal is to predict step labels \(\{s\}_{i=1}^T \subset S\) for all the frames. S denotes a set of 32 step categories and a background label.
4.1.2 Evaluation Metrics
Following (Li et al., 2023), we report frame-wise accuracy, segmental edit score, and segmental F1 score with temporal overlapping thresholds at k% (F1@k, \(k\in \{10,25,50,75\}\)).
4.1.3 Baseline models
We use MS-TCN++ (Li et al., 2023) and ASFormer (Yi et al., 2021) as baselines for this task. We extract features from video frames using I3D (Carreira and Zisserman, 2017) pretrained on Kinetics (Carreira and Zisserman, 2017) and input these features to the action segmentation models. We fixed the feature extraction part and train the segmentation part from scratch using the training set.
For feature extraction, we resize the longer side of each frame into 640 with their aspect ratio maintained. We calculate optical flows by RAFT (Teed and Deng, 2020) pretrained on Sintel dataset (Butler et al., 2012). We extract features from two-stream I3D model (Carreira and Zisserman, 2017) pretrained on Kinetics dataset (Carreira and Zisserman, 2017). We input a clip of 21 consecutive frames around each frame to produce a feature vector.
For training of MS-TCN++ (Li et al., 2023), we set the number of layers to 12 in the prediction stage, 13 in the refinement stage, and the number of refinement stages to four. For training of ASFormer (Yi et al., 2021), we set the number of blocks per encoder/decoder to 11 and the number of the decoders to five. We train both models with an initial learning rate of \(5\times 10^{-4}\) and batch size of 1. We use Adam (Kingma and Ba, 2014) optimizer with the cosine annealing scheduler for MS-TCN++ and the ReduceLROnPlateau scheduler for ASFormer. We set the number of epochs to 100 for MS-TCN++ and 120 for ASFormer. We use the training, validation, and test split for training, hyperparameter tuning, and evaluation, respectively.
We used a single NVIDIA Tesla V100 GPU for both models, and it took 2.5 hour and 2 days for training for MS-TCN++ (Li et al., 2023) and ASFormer (Yi et al., 2021), respectively.
4.1.4 Results
Table 4 shows the quantitative results. Since the order of the steps are strictly defined by protocols and only trials with no ordering errors were included in the evaluation, the models showed high accuracy compared to unconstrained benchmarks (Kuehne et al., 2016; Sener et al., 2022). Especially edit score and F1@{10,25} are over 90%, which indicates both models can predict the presence and the category of the steps with high accuracy.
However, the score significantly drops at a stricter threshold (F1@75). This suggests that the models struggles in inferring the exact boundary between steps at the operation level. We show major failure cases of MS-TCN++ in Figure 9. As shown inside the red rectangles, MS-TCN++ exhibited missing steps (upper), false positives, and totally wrong boundaries (bottom). In the bottom example, “add wash buffer” and “add 70pct ethanol” were confused with each other since their operations within the steps are very similar (transferring different reagents from one tube to another). To resolve the above issue, inferring the correspondence between a step and actions at a fine-grained operation level (e.g.,, making sure that the reagent has been surely loaded and delivered to the sample tube once) will be required.
4.2 Atomic Operation Detection
4.2.1 Definition
The objective of this task is to localize the atomic operations and their constituents. Adopting similar formulation to Heilbron et al. (2015); Damen et al. (2022), given T frames in a video, the goal is to predict a set of N atomic operation instances \(\textbf{Y}=\{\textbf{y}_i\}_{i=1}^N\), where \(\textbf{y}_i=(t_s, t_e, v, m, a)\) is a tuple consisting of start and end times \((t_s, t_e)\), verb v, manipulated object m, and affected object a. Although atomic operations are assigned to each hand, this time we did not take the hand side into account and counted as duplicates if operations were done by both hands, for model simplicity.
4.2.2 Evaluation metrics
We evaluate by the standard mean average precision (mAP) at different temporal IoU (tIoU) thresholds. In addition to calculating the respective scores for verb, manipulated/affected objects, we calculate the mAP of atomic operations where counted as correct only if the predicted classes (v, m, a) are all correct.
4.2.3 Baseline models
We choose ActionFormer (Zhang et al., 2022) as an off-the-shelf method. Input features are calculated from video frames using pre-trained I3D (Carreira and Zisserman, 2017) same as Section 4.1. In the model, three classification heads are used to predict v, m, and a respectively, followed by a single head taking concatenation of output probability vectors of the three heads as input, classifying the set (v, m, a) from 244 possible combinations. We used the same architecture as Zhang et al. (2022) for each head.
Similar to step segmentation, we resize the longer side of each frame into 640 with their aspect ratio maintained. We calculate optical flows by RAFT (Teed and Deng, 2020) pretrained on Sintel dataset (Butler et al., 2012). We extract features from two-stream I3D model (Carreira and Zisserman, 2017) pretrained on Kinetics dataset (Carreira and Zisserman, 2017). Finally, we input a clip of 16 consecutive frames at a stride of 4 to calculate the feature vector.
On training ActionFormer (Zhang et al., 2022), we use two convolution layers for projection, seven transformer blocks for the encoder with 2\(\times \) downsampling for the last five blocks. The number of pyramid feature levels is six and a regression range on i-th level is [\(2^{i-1}\), \(2^{i+1}\)). We set a window size of nine for local self-attention. We train the model with an initial learning rate of \(10^{-4}\) and batch size of 1 for 40 epochs with a linear warmup of 5 epochs. We use the Adam optimizer with the cosine annealing scheduler. We use the training, validation, and test split for training, validation, and evaluation, respectively.
It took about an hour to train ActionFormer (Zhang et al., 2022) using a single NVIDIA Tesla V100 GPU.
4.2.4 Results
The result is shown in Table 5. Each row shows mAP at different tIoU thresholds and their average score. We report scores by atomic operation, verb, manipulated object, and affected object, respectively. The results show that atomic operation detection is more challenging than step segmentation due to their unconstrained order.
The lower scores in object classes show difficulties in distinguishing similar-looking objects (e.g.,, blue and yellow pipette, tube with and without spin column) since the model does not explicitly detects objects.
4.3 Object Detection
4.3.1 Definition
The goal is to detect all objects’ location and their categories (left/right hand + 33 tools) in a image. Although detecting objects during manipulation is more important in a practical view, we first evaluate this standard-setting since we have access to exhaustive object location annotation.
4.3.2 Evaluation metrics
We report COCO AP and Average Recall (AR) (Lin et al., 2014). We report the AR of manipulated/affected objects to evaluate how well objects under interaction can be detected.
4.3.3 Baseline models
We adopt two transformer-based models (Deformable DETR (Zhu et al., 2021) and DINO (Zhang et al., 2023)) as baseline models. We fine-tune the models pretrained on the MS COCO (Lin et al., 2014) dataset.
We used (i) two-stage Deformable DETR (Zhu et al., 2021) with ResNet-50 backbone and iterative bounding box refinement and (ii) four-scale DINO (Zhang et al., 2023) with ResNet-50 backbone. We finetuned the models pretrained on COCO (Lin et al., 2014) with the training spit, and evaluate with the test split. We used the models implemented in MMDetection (Chen et al., 2019) and follow their default training and test settings. Training of Deformable DETR (Zhu et al., 2021) and DINO (Zhang et al., 2023) took approximately 5.5 hours and 1 hour, respectively, using a single NVIDIA V100 GPU.
4.3.4 Results
Table 6 shows the quantitative results. The overall scores were high in both models because the same sets of objects were used in training and testing.
However, both models struggle with smaller objects, showing unsatisfying scores at \(\textrm{AP}_{\textrm{S}}\) in particular. Interestingly, we did not observe significant degradation in \(\textrm{AR}_{\textrm{manip}}\) and \(\textrm{AR}_{\textrm{affect}}\) compared to \(\textrm{AR}\), suggesting that producing bounding boxes of manipulated/affected objects under interaction itself is not a hard task.
Figure 10 shows average precision for each object class by the models. While the most of the objects can be detected at over 60 points in AP, small objects such as tips and tubes are hard to detect in both models.
We also show qualitative results in Figure 11. Images in the left and right denote ground truth annotation and model prediction, respectively. As denoted in the red arrows, smaller objects such as micro tube and tip are often missed.
4.4 Manipulated/Affected Object Detection
Qualitative results on manipulated and affected object detection. ,
,
and
box denote left hand, right hand, manipulated object, and affected object. M and A next to the label name mean whether the object is manipulated or affected object. The leftmost image shows successful case while the middle two display failures in manipulated object detection. Rightmost shows a failure in linking the correct affected object in affected object detection (Color figure online)
To understand atomic operations primarily handled by hands, it is important to localize and recognize objects manipulated by hands. It is also important to localize affected objects that are influenced through the manipulated object (Yu et al., 2023) in the case of tool-use actions where a tool is used as a manipulated object to work on another object, i.e., the affected object. To this end, we extend the active object prediction scheme (Shan et al., 2020; Darkhalil et al., 2022) and evaluate the performance on detecting manipulated and affected objects in video frames.
4.4.1 Definition
Given single frame, the goal is to detect the hands and their manipulated objects as well as the affected objects. Different from Shan et al. (2020); Darkhalil et al. (2022), we also predict the object categories of the manipulated/affected objects. While this task looks similar to atomic operation detection which also predicts the categories of the manipulated/affected objects, this task emphasizes spatial identification of objects involved in interactions. Detecting manipulated/affected objects in still images is critical for biological experiments, as it enables precise documentation of object interactions, independent of temporal context. This is vital in tasks like pipetting or transferring reagents, which rely on accurate spatial understanding.
4.4.2 Evaluation metrics
We use COCO \(\textrm{AP}_{50}\) (Lin et al., 2014) following the prior works. To evaluate the performance by elements, we calculate mAP for (1) left and right hands (2) manipulated objects of all object categories (3) affected objects of all object categories. Then we report mAP of both hands, hands and manipulated objects both correct (H + Manipulated), and all three entities (H + M + Affected).
4.4.3 Baselines, results, and challenges
We modify the Hand Object Detector (Shan et al., 2020) by adding additional branches. First, we classify the hand side (left or right hand) and object categories at once in a single classification head. Our hand state classification head predicts if the hand manipulates something or not. To detect one manipulated and one affected object, respectively, five branches are added for (a) binary manipulated state classification (manipulated or not) (b) manipulated object offset prediction (offsets from hand location, similar to Shan et al. (2020)) (c) binary affecting state classification head (affecting or not) (d) binary affected state classification (affected or not) (e) affected object offset prediction (offsets from manipulated object location). We determine manipulated objects by first thresholding the detected objects by manipulation state prediction and associating to hands that are closest to hand location + manipulated object offset vector. Then we apply the same operations to the affected objects to associate the affected object and the manipulated object. Different from prior works, we use all the annotated bounding boxes instead of using active object annotation solely. We also replaced the base object detector to 4-scale DINO (Zhang et al., 2023) with a ResNet-50 backbone.
We modify the Faster R-CNN (Ren et al., 2015)-based Hand Object Detector (Shan et al., 2020) as described in the main text. We use 4-scale DINO (Zhang et al., 2023) with a ResNet-50 backbone as an object detector. We first train the object detector for 30 epochs with the default training settings of the original DINO codebase DINO . We then train the auxiliary prediction heads with the object detector frozen for 5 epochs. We use SGD optimizer and set an initial learning rate as \(10^{-4}\) and batch size as 2. We use the training and test split for training and evaluation, respectively. It took 15 minutes to fine-tune the hand-object detector using a single A100 GPU.
We show the results in Table 7. Hands were almost perfectly detected. However, detecting manipulated objects corresponding to hands was found to be very challenging despite the high object detection performance reported in Section 4.3. Objects are densely arranged and some objects (e.g., micro tube) are placed within other objects (e.g., centrifuge), which makes it difficult to accurately identify the manipulated and affected objects. The scores of manipulated/affected object detection by the left hand were especially low because smaller objects such as micro tubes were often handled with the non-dominant left hand. Figure 12 shows the qualitative results. While objects without occlusions were relatively easier, overlapping objects and smaller objects were hard to detect possibly due to the naive offset prediction scheme and lack of temporal context.
Incorporating multiple views into the detection pipeline could improve the performance of smaller objects by reducing occlusion and providing complementary perspectives. For instance, pooling features extracted from multiple views at a given timestamp could enhance spatial understanding and improve predictions for challenging scenarios.
4.5 Camera Viewpoints Analysis
In addition to the baseline experiments using egocentric videos, we examined the use of other exocentric (third-person) view inputs. We trained separate models using exocentric views (view 1: back left, view 2: back right, view 3: front left, view 4: front right, view 5: top from the participant’s view). We also trained models using the average feature vectors from all exocentric views (Exo all) and both egocentric and exocentric views (Ego + Exo).
4.5.1 Step segmentation
Table 8 shows the results of step segmentation. In this task, we observed similar results across different views, observing slightly better results using an egocentric viewpoint. Combining the features from all exocentric views achieved the best result.
4.5.2 Atomic operation detection
Table 9 presents the results of atomic operation detection (mAP@0.5). Compared to step segmentation, we observed a significant performance drop when the actions were viewed from the participant’s right side. This decline was attributed to verbs such as shake, press, and release, as well as objects like cell culture plate lids and pipettes, which are primarily handled with the participant’s right hand. These results suggest that occlusion of the right hand and the interacting objects from this perspective contributed to the performance drop.
5 Discussion
5.1 Practical values
For convenience and safety, we recorded “mock” experiments that were taken from real experiments but did not use real materials and reagents. Some operations are simplified to remove redundant waiting time. Therefore, the recognition model trained by this dataset cannot be directly applied to automating real biological experiments. We focused on the operational aspects of experiments, such as whether or not the steps were conducted correctly in terms of protocols. Thus we did not verify the experimental contents such as whether or not the amount of reagent is correct, whether or not the reaction occurred properly, and whether or not the goal is achieved.
The purpose of this dataset is not to train models for immediate application to real-world laboratory videos but rather to serve as a dataset for studying and improving models’ ability to understand fine-grained hand-object interactions in biological experiments. By collecting controlled, focused data, we prioritize quality over quantity, ensuring that every sample contributes to the model’s understanding of critical features such as hand-object relationships and procedural structure.
Generalization of new labs and equipment is indeed a broader challenge, but it extends beyond the scope of our current goals. Real-world biological experiment videos often include long waiting times, abrupt interruptions, and the use of special equipment, which limits their utility in understanding hand-related operations in detail. FineBio fills this gap by providing a structured and well-annotated dataset to study the essential elements of activity understanding in this domain. It enables researchers to isolate and address fundamental challenges before tackling the complexities of generalization across different locations. Collecting experiments using real materials would be of interest in future work.
5.2 Use of multi-view videos
While the baseline experiments use egocentric data, incorporating third-person views improved results in Step Segmentation and Atomic Object Detection (Section 4.5). Combining egocentric and third-person views reduced ambiguities caused by occlusions or limited depth perception. More sophisticated approaches such as learnable fusion, and knowledge transfer between viewpoints could further enhance performance. Furthermore, it could be used to extract 3D information on human poses and objects, which opens studying the 3D interaction between hands and objects.
5.3 Towards structural understanding of biological experiments
Procedural understanding in biological experiments requires a framework that integrates high-level protocols with fine-grained actions and spatial tool localization. Existing methods like task graphs (Ashutosh et al., 2023) or action scene graphs (Rodin et al., 2024) capture the temporal dynamics between high-level steps or instances, that can be used as a structured representation for procedural understanding. However, they lack holistic procedural structures from object interaction to longer temporal steps and do not address the specific requirements of biological experiments, where detailed hierarchical relationships are critical for accurate understanding.
FineBio introduces explicitly curated hierarchical annotations across protocols, steps, and atomic operations, enabling nuanced analysis of experimental workflows. Adapting such structured representation to FineBio would be of interest for future development. The baseline results and analysis provided in this work would be useful for further model development.
6 Conclusion
We have presented FineBio, a fine-grained dataset of biological experiments with hierarchical annotation. In addition to the extensive multi-level annotation, four benchmark tasks and their baseline evaluation have been presented. While FineBio simplifies some aspects of real biological experiments, its focus on fine-grained hierarchical annotations and structured activity recognition establishes a strong foundation for building reliable models. Future research can extend these models to real-world applications, with FineBio serving as a benchmark for methodological innovation in biological experiment monitoring and beyond. We hope FineBio enhances the community to further develop new methods in the field of biological experiment understanding.
Notes
They employ multiple cameras installed on tables and instruments. Still, each location has a generic top-view camera and a camera focusing on the liquids, effectively having a single view per location.
Four trials were excluded from evaluation due to critical procedural errors found after the recording, while another six trials with minor procedural errors were kept. See supplementary materials for detail.
References
Alam, M. M., & Islam, M. T. (2019). Machine learning approach of automatic identification and counting of blood cells. Healthcare technology letters, 6(4), 103–108.
Alayrac, J. -B., Bojanowski, P., Agrawal, N., Sivic, J., Laptev, I., & Lacoste-Julien, S. (2016). Unsupervised learning from narrated instruction videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 4575–4583).
Ashutosh, K., Ramakrishnan, S. K., Afouras, T., & Grauman, K. (2023). Video-mined task graphs for keystep recognition in instructional videos. Proceedings of the Advances in neural information processing systems 36.
Bansal, S., Arora, C., & Jawahar, C. (2022). My view is the best view: Procedure learning from egocentric videos. In Proceedings of the European Conference on Computer Vision, (pp. 657–675).
Butler, D. J., Wulff, J., Stanley, G. B., & Black, M. J. (2012). A naturalistic open source movie for optical flow evaluation. In Proceedings of the European Conference on Computer Vision, (pp. 611–625).
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Computer Vision and Pattern Recognition, (pp. 4724–4733).
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C. C., & Lin, D. (2019). MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155
Cui, J., Gong, Z., Jia, B., Huang, S., Zheng, Z., Ma, J., & Zhu, Y. (2023). Probio: A protocol-guided multimodal dataset for molecular biology lab. Advances in Neural Information Processing Systems DataBase and Benchmarks Track 36.
Damen, D., Doughty, H., Farinella, G. M., Furnari, A., Ma, J., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., & Wray, M. (2022). Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision, 130, 33–55.
Darkhalil, A., Shan, D., Zhu, B., Ma, J., Kar, A., Higgins, R., Fidler, S., Fouhey, D., & Damen, D. (2022). Epic-kitchens visor benchmark: Video segmentations and object relations. Advances in Neural Information Processing Systems, 35, 13745–13758.
Edlund, C., Jackson, T. R., Khalid, N., Bevan, N., Dale, T., Dengel, A., Ahmed, S., Trygg, J., & Sjögren, R. (2021). Livecell—a large-scale dataset for label-free live cell segmentation. Nature methods, 18(9), 1038–1045.
Fu, Q., Liu, X., & Kitani, K. (2022). Sequential voting with relational box fields for active object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 2374–2383).
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al. (2022). Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition, (pp. 18995–19012).
Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B., et al. (2024). Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition, (pp. 19383–19400).
Heilbron, F. C., Escorcia, V., Ghanem, B., & Niebles, J. C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Computer Vision and Pattern Recognition, (pp. 961–970).
Holland, I., & Davies, J. A. (2020). Automation in the life science research laboratory. Frontiers in Bioengineering and Biotechnology, 8, Article 571777.
Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., Hide, W., Hill, D. P., Kania, R., Schaeffer, M., St Pierre, S., et al. (2008). The future of biocuration. Nature, 455(7209), 47–50.
Kaku, A., Liu, K., Parnandi, A., Rajamohan, H. R., Venkataramanan, K., Venkatesan, A., Wirtanen, A., Pandit, N., Schambra, H., & Fernandez-Granda, C. (2022). Strokerehab: A benchmark dataset for sub-second action identification. Advances in Neural Information Processing Systems, 35, 1671–1684.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Kuehne, H., Gall, J., & Serre, T. (2016). An end-to-end generative framework for video segmentation and recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
Lee, S. -P., Lu, Z., Zhang, Z., Hoai, M., & Elhamifar, E. (2024). Error detection in egocentric procedural task videos. In: Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition, pp. 18655–18666.
Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., & Gall, J. (2023). Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6), 6647–6658.
Lin, T. -Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, (pp. 740–755).
Lin, Z., Wei, D., Petkova, M. D., Wu, Y., Ahmed, Z., Zou, S., Wendt, N., Boulanger-Weill, J., Wang, X., Dhanyasi, N., et al. (2021). Nucmm dataset: 3d neuronal nuclei instance segmentation at sub-cubic millimeter scale. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention, (pp. 164–174).
Liu, X., Iwase, S., & Kitani, K. M. (2021). Stereobj-1m: Large-scale stereo image dataset for 6d object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 10870–10879).
Miech, A., Zhukov, D., Alayrac, J. -B., Tapaswi, M., Laptev, I., & Sivic, J. (2019). Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 2630–2640).
Moltisanti, D., Wray, M., Mayol-Cuevas, W., & Damen, D. (2017). Trespassing the boundaries: Labeling temporal bounds for object interactions in egocentric video. In Proceedings of the IEEE International Conference on Computer Vision, (pp. 2886–2894).
Narasimhaswamy, S., Nguyen, T., & Nguyen, M. H. (2020). Detecting hands and recognizing physical contact in the wild. Advances in neural information processing systems, 33, 7841–7851.
Nishimura, T., Sakoda, K., Hashimoto, A., Ushiku, Y., Tanaka, N., Ono, F., Kameko, H., & Mori, S. (2021). Egocentric biochemical video-and-language dataset. In IEEE International Conference on Computer Vision Workshops, (pp. 3129–3133).
Nishimura, T., Yamamoto, K., Haneji, Y., Kajimura, K., Nishiwaki, C., Daikoku, E., Okuda, N., Ono, F., Kameko, H., & Mori, S. (2024). Biovl-qr: Egocentric biochemical video-and-language dataset using micro qr codes. arXiv preprint arXiv:2404.03161
Papadopoulos, D. P., Mora, E., Chepurko, N., Huang, K. W., Ofli, F., & Torralba, A. (2022). Learning program representations for food images and cooking recipes. In: Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition, pp. 16559–16569.
Ragusa, F., Furnari, A., Livatino, S., & Farinella, G. M. (2021). The meccano dataset: Understanding human-object interactions from egocentric videos in an industrial-like domain. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, (pp. 1569–1578).
Rai, N., Chen, H., Ji, J., Desai, R., Kozuka, K., Ishizaka, S., Adeli, E., & Niebles, J. C. (2021). Home action genome: Cooperative compositional action understanding. In: CVPR, pp. 11184–11193.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In: Proceedings of the Advances in neural information processing systems, vol. 28.
Rodin, I., Furnari, A., Min, K., Tripathi, S., & Farinella, G. M. (2024). Action scene graphs for long-form understanding of egocentric videos. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition, (pp. 18622–18632).
Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania, D., Wang, R., & Yao, A. (2022). Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In: Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition, pp. 21096–21106.
Shan, D., Geng, J., Shu, M., & Fouhey, D. F. (2020). Understanding human hands in contact at internet scale. In: Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition, pp. 9866–9875.
Shan, D., Higgins, R., & Fouhey, D. (2021). Cohesiv: Contrastive object and hand embedding segmentation in video. Advances in Neural Information Processing Systems, 34, 5898–5909.
Shao, D., Zhao, Y., Dai, B., & Lin, D. (2020). Finegym: A hierarchical video dataset for fine-grained action understanding. In: Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition, pp. 2616–2625.
Song, Y., Byrne, E., Nagarajan, T., Wang, H., Martin, M., & Torresani, L. (2023). Ego4d goal-step: Toward hierarchical understanding of procedural activities. Proceedings of the Advances in neural information processing systems 36.
Tang, Y., Ding, D., Rao, Y., Zheng, Y., Zhang, D., Zhao, L., Lu, J., & Zhou, J. (2019). Coin: A large-scale dataset for comprehensive instructional video analysis. In: Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition.
Teed, Z., & Deng, J. (2020). Raft: Recurrent all-pairs field transforms for optical flow. In: Proceedings of the European Conference on Computer Vision, pp. 402–419.
Wei, J., Suriawinata, A., Ren, B., Liu, X., Lisovsky, M., Vaickus, L., Brown, C., Baker, M., Tomita, N., Torresani, L., et al. (2021). A petri dish for histopathology image analysis. In: Artificial Intelligence in Medicine, pp. 11–24.
Xu, J., Rao, Y., Yu, X., Chen, G., Zhou, J., & Lu, J. (2022). Finediving: A fine-grained dataset for procedure-aware action quality assessment. In: Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition, pp. 2949–2958.
Yagi, T., Hasan, M. T., & Sato, Y. (2021). Hand-object contact prediction via motion-based pseudo-labeling and guided progressive label correction. In: Proceedings of the British Machine Vision Conference.
Yi, F., Wen, H., & Jiang, T. (2021). Asformer: Transformer for action segmentation. In: Proceedings of the British Machine Vision Conference.
Yu, Z., Huang, Y., Furuta, R., Yagi, T., Goutsu, Y., & Sato, Y. (2023). Fine-grained affordance annotation for egocentric hand-object interaction videos. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2154–2162.
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L., & Shum, H. -Y. (2023). Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In: Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=3mRwyG5one
Zhang, C. -L., Wu, J., & Li, Y. (2022). Actionformer: Localizing moments of actions with transformers. In Proceedings of the European Conference on Computer Vision, vol. 13664, (pp. 492–510).
Zhang, L., Zhou, S., Stent, S., & Shi, J. (2022). Fine-grained egocentric hand-object segmentation: Dataset, model, and applications. In Proceedings of the European Conference on Computer Vision, (pp. 127–145).
Zhang, Z. (2000). A flexible new technique for camera calibration. IEEE Transactions on pattern analysis and machine intelligence, 22(11), 1330–1334.
Zhao, H., Torralba, A., Torresani, L., & Yan, Z. (2019). Hacs: Human action clips and segments dataset for recognition and temporal localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, (pp. 8668–.8678)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable detr: Deformable transformers for end-to-end object detection. In: Proceedings of the International Conference on Learning Representations.
Zhukov, D., Alayrac, J. -B., Cinbis, R. G., Fouhey, D., Laptev, I., & Sivic, J. (2019). Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition.
Funding
Open Access funding provided by The University of Tokyo.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Stratis Gavves.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yagi, T., Ohashi, M., Huang, Y. et al. FineBio: A Fine-Grained Video Dataset of Biological Experiments with Hierarchical Annotation. Int J Comput Vis 133, 7352–7367 (2025). https://doi.org/10.1007/s11263-025-02523-2
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02523-2