Transformations and Data Quality

Details the process of getting data ready to be submitted

Uploading data to the repository

As of right now, users/curators will make a submission to the Library Bill Tracking repositoryarrow-up-right hosted by GitHub. Submissions must be contained in a folder that follows our file naming conventions and submitted to the “Datasets” folder in the repository. For the data curators who have access to the repository, this can be done by making a commit. For community users that would like to submit their data, they will need to make a pull request.

To check for authenticity, the data stewards will perform a MD5 checksum validation on the submission that should match with the checksum value specified in a submission’s manifest. The submission can also be run through an anti-virus software at the discretion of the data stewards. A review of each submission by the data stewards will also take place.

Confidential, sensitive, or private information

Because the relevant data concern legislative proposals and the progress and contents of state-level legislation, very little sensitive personal information is expected to be contained in the data. Datasets will comprise publicly-available information, and personal information like the names and affiliations of state-level political figures are not considered sensitive or private. Data curators will not select data that contain similar name, address, and/or political affiliation information about non-public figures, and any data submitted by users will be reviewed for sensitive information. If a dataset contains sensitive or private information, it will not be ingested into the repository.

Identifiers for each data collection

Since our data is being hosted on GitHub at this point in time, the corresponding GitHub URL for the submission, if it is approved, will be the persistent identifier. For identification within the repository, each submission will have a unique identifier that is assigned in the metadata by the data stewards once a submission is approved.

Additional metadata

The data stewards will keep a record of all submissions. This record contains when the submission took place, the file/folder names used, who the author is (GitHub username and any contact information provided), what bills (year and state/fed) it covered, and the status of the submission.

Additional Documentation

Each dataset containing information about the bills will have an accompanying bill history dataset, with each action taken on a given bill linked to that bill via its unique identifier (see example dataarrow-up-right).

Curation steps for data proper

Most of the data fields needed to submit a collection to this repository (check the Data Dictionary for required variables) can be found on the websites for individual state assemblies. The information on these websites are not typically stored in standard data formats (CSV, JSON, etc). LegiScanarrow-up-right, however, contains both CSV and JSON files for past and current bills for all state and federal legislation. The curators recommend that depositors use the state websites to locate the bill identifiers for legislation regarding libraries and use LegiScan to access the raw data for those bills. You can download datasets from LegiScan for each year of a state's general assembly that will contain all the proposed bills during that session. Using the bill identifiers noted earlier, you can pull the necessary data from the LegiScan datasets and format them into one bills.csv file and one history.csv file.

Accepted data formats

In all cases, data in this repository should be stored in open, non-proprietary formats. Tabular data should be stored in CSV files. Non-tabular data, including either plain or annotated text of bills and any metadata associated with but not included in the tabular data should be stored either as plain text (TXT) or as PDF files.

Tabular data in tabular formats other than CSV (such as XSLX or TSV) will be converted to CSV. Rich text in formats other than PDF will be converted to text-based PDFs, and OCR will be performed using Tesseractarrow-up-right on non-text-based PDFs. Tabular data stored in non-tabular formats (e.g. tables embedded in text-based PDFs) will be extracted using Excaliburarrow-up-right. If necessary, for tabular data of a manageable size, data curators may also perform manual data entry to extract tabular data from non-tabular sources.

Is data in proprietary formats accepted?

The primary method of data collection for this repository is opportunistic collection by the data curators themselves, who may collect data in other formats and convert it to CSV/PDF/TXT as applicable. Community users who wish to submit data must submit it in one of the preferred formats. Submissions can be shared directly with a data curator, and a ‘submission guidelines’ note within the repository itself will include both contact methods for data curators and instructions for allowed submission formats.

How are changes to datasets tracked?

Metadata for each dataset will contain an “updated_last” and an “update_notes” field so that users can see when the dataset was last changed and what changes were made in the most recent version.

Some normalization steps:

Because this data is going to be highly variable and text-based, some normalization will be necessary but too much risks reducing the value of the data.

  • All observations should be structured as rows and all variables should be structured as columns. For example, each bill should be represented by a single row and the variables each bill will contain, like date created and title, should be represented as columns. If the data is not structured in this way, a wide or long pivot may be necessary. To automate this, you can use either the pivot_longer( )arrow-up-right or pivot_wider( )arrow-up-right functions from the tidyverse package for R.

  • Dates should be in ISO 8601 format (ex: YYYY-MM-DD). In OpenRefine, this can be done by highlighting cells that need to change, then selecting “Edit cells”, then “Common transformations”, and lastly “To Date.”

  • Missing or NULL values should be empty in the table. Do not put the number 0 or N/A for these values.

  • Each bill in a dataset should have a unique identifier.

  • Multi-value columns should be split into individual columns. An alternative to this would be to create another table that contains a row for each value in the multi-value column. To do this, the unique identifier for its corresponding bill must be in the new table (see history.csv file in example dataarrow-up-right).

  • Remove special characters.

  • All values in a column must be of the same data type. For instance, under a “bill_title” column, all values should be a text type, no boolean or numeric types.

  • All column names must have the same case/capitalization.

Curator review of data collections

The repository curators will check submitted dataset to ensure that the data meet the principles of Tidy Data and will apply the normalization steps outlined above to all data ingested into the repository. Additionally, curators will review each collection to ensure that the bills represented in the data center on legislation about libraries and library workers. Many state general assemblies host their own websites where bills can be searched by subject index which can help when coding your data. For example, the Missouri House of Representatives has a search portalarrow-up-right where users can search by subject indices. In this example, it would be useful to search for bills with the subject index "Libraries and Archives." You can also use LegiScan to search for bills relevant keywords.

Last updated