parlai.core.build_data

Utilities for downloading and building data.

These can be replaced if your particular file system does not support them.

class parlai.core.build_data.DownloadableFile(url, file_name, hashcode, zipped=True, from_google=False)[source]

Bases: object

A class used to abstract any file that has to be downloaded online.

Any task that needs to download a file needs to have a list RESOURCES that have objects of this class as elements.

This class provides the following functionality:

  • Download a file from a URL / Google Drive

  • Untar the file if zipped

  • Checksum for the downloaded file

  • Send HEAD request to validate URL or Google Drive link

An object of this class needs to be created with:

  • url <string> : URL or Google Drive id to download from

  • file_name <string> : File name that the file should be named

  • hashcode <string> : SHA256 hashcode of the downloaded file

  • zipped <boolean> : False if the file is not compressed

  • from_google <boolean> : True if the file is from Google Drive

__init__(url, file_name, hashcode, zipped=True, from_google=False)[source]

Initialize self. See help(type(self)) for accurate signature.

checksum(dpath)[source]

Checksum on a given file.

Parameters

dpath – path to the downloaded file.

check_header()[source]

Performs a HEAD request to check if the URL / Google Drive ID is live.

parlai.core.build_data.built(path, version_string=None)[source]

Check if ‘.built’ flag has been set for that task.

If a version_string is provided, this has to match, or the version is regarded as not built.

parlai.core.build_data.mark_done(path, version_string=None)[source]

Mark this path as prebuilt.

Marks the path as done by adding a ‘.built’ file with the current timestamp plus a version description string if specified.

Parameters
  • path (str) – The file path to mark as built.

  • version_string (str) – The version of this dataset.

parlai.core.build_data.download(url, path, fname, redownload=False, num_retries=5)[source]

Download file using requests.

If redownload is set to false, then will not download tar file again if it is present (default False).

parlai.core.build_data.make_dir(path)[source]

Make the directory and any nonexistent parent directories (mkdir -p).

parlai.core.build_data.remove_dir(path)[source]

Remove the given directory, if it exists.

parlai.core.build_data.untar(path, fname, delete=True)[source]

Unpack the given archive file to the same directory.

Parameters
  • path (str) – The folder containing the archive. Will contain the contents.

  • fname (str) – The filename of the archive file.

  • delete (bool) – If true, the archive will be deleted after extraction.

parlai.core.build_data.ungzip(path, fname, deleteGZip=True)[source]

Unzips the given gzip compressed file to the same directory.

Parameters
  • path (str) – The folder containing the archive. Will contain the contents.

  • fname (str) – The filename of the archive file.

  • deleteGZip (bool) – If true, the compressed file will be deleted after extraction.

parlai.core.build_data.download_from_google_drive(gd_id, destination)[source]

Use the requests package to download a file from Google Drive.

parlai.core.build_data.download_models(opt, fnames, model_folder, version='v1.0', path='aws', use_model_type=False)[source]

Download models into the ParlAI model zoo from a url.

Parameters
  • fnames – list of filenames to download

  • model_folder – models will be downloaded into models/model_folder/model_type

  • path – url for downloading models; defaults to downloading from AWS

  • use_model_type – whether models are categorized by type in AWS

parlai.core.build_data.modelzoo_path(datapath, path)[source]

Map pretrain models filenames to their path on disk.

If path starts with ‘models:’, then we remap it to the model zoo path within the data directory (default is ParlAI/data/models). We download models from the model zoo if they are not here yet.

parlai.core.build_data.download_multiprocess(urls, path, num_processes=32, chunk_size=100, dest_filenames=None, error_path=None)[source]

Download items in parallel (e.g. for an image + dialogue task).

WARNING: may have issues with OS X.

Parameters
  • urls – Array of urls to download

  • path – directory to save items in

  • num_processes – number of processes to use

  • chunk_size – chunk size to use

  • dest_filenames – optional array of same length as url with filenames. Images will be saved as path + dest_filename

  • error_path – where to save error logs

Returns

array of tuples of (destination filename, http status code, error message if any). Note that upon failure, file may not actually be created.