Link Search Menu Expand Document

The Benchmark Data for CMR

Table of contents


cmr_formulation


Download and Preprocess

Here we use the MRQA datasets as an example to show how datasets should be processed.

  1. download the data files
    cd data/mrqa/
    bash download.sh
    
  2. preprocess the datasets
    cd ~/CMR/ # go to the root folder of CMR project, say ~/CMR/
    python data/data_formatter.py
    

After thest two steps, you should see a few mrqa_* folders under the data folder, where each is for a particular data cluster.

Generating OOD data streams

Generate upstream model predictions.

We first use the upstream model to infer examples from other data clusters.

mkdir -p upstream_resources/qa_upstream_preds/tmp/ # under the CMR folder
bash scripts/run_mrqa_infer.sh

The first part of this script is to test the upstream model on the upstream training data, and the second part is to test the upstream model on other data clusters’ dev data.

Sample data streams.

Now we generate the data streams and evaluation sets that we need for our experiments. The default configurations that are used here can be found in the code file.

mkdir -p experiments/eval_data/qa/
python cmr/benchmark_gen/sample_submission_streams.py --task_name QA

The generated data streams can be visualized by running cmr/benchmark_gen/visualize_streams.py.

More configurations for generating OOD data streams

ood_data

TODO: here, we will introduce the details for using the sample_submission_streams.py for generating the OOD data streams as described in the above picture.