Ability to cross-reference your databases with data from different organizations in a simple and easy way. There are already hundreds of public datasets from the largest organizations in Brazil and worldwide present in our datalake.
Commitment to transparency, data quality, and development of better research, analysis, and solutions for society. We not only democratize access to open data but also quality data. We have a specialized team that reviews and ensures the quality of data added to the datalake.
Participation in an ever-growing community: thousands of journalists, researchers, developers already use and follow Data Basis.
Want to upload data to DB and help us build this repository? Wonderful! We've organized everything you need in the manual below in 8 steps
To facilitate the explanation, we'll follow a ready-made example with data from RAIS.
You can navigate through the steps in the menu on the left.We strongly suggest joining our Discord channel to ask questions and interact with the team and other contributors! 😉
Some knowledge is necessary to carry out this process:
Don't have some of these skills but want to contribute?We have a data team that can help you, just join our Discord and send a message in #want-to-contribute.
We keep the list of datasets for volunteers in our Github. To start uploading a base of your interest, just open a new issue with data. If your base (dataset) is already listed, just mark your Github user as assignee
Your first task is to fill in the information in the issue. These information will help you understand the data better and will be very useful for treatment and filling in metadata.
When you finish this step, call someone from the data team to that the information you mapped about the dataset already enter our site!
Download here the template
template
and rename it to <dataset_id>
(defined in the issue of step 1). This template folder simplifies and organizes all the steps from here on. Its
structure is as follows:
<dataset_id>/
code/
: Necessary codes for capture and cleaning of data
(we'll see more in step
4).input/
: Contains all the original files with data, exactly
as downloaded from the primary source. (we'll see more in step
4).output/
: Final files, already in the ready-to-upload format (we'll see more in step
4).tmp/
: Any temporary files created by the code in /code
during the cleaning and treatment process (we'll see more in step
4).extra/
architecture/
: Architecture tables (we'll see more in step 3).auxiliary_files/
: Auxiliary files to the data (we'll see more in step 5).dicionario.csv
: Dictionary table for the entire dataset (we'll see more in step 6).Only the
code
folder will be committed to your project, the other files will only exist locally or in Google Cloud.
The architecture tables determine what the structure of each table in your dataset is. They define, for example, the name, order, and metadata of the variables, as well as compatibilities when there are changes in versions (for example, if a variable changes name from one year to the next).
Each dataset table must have its own architecture table (spreadsheet), which must be filled in Google Drive to allow correction by our data team.
The RAIS architecture tables can be consulted here. They are a great reference for you to start your work since they have many variables and examples of various situations you might end up encountering.
A each beginning and end of step, consult our style guide to ensure you're following the BD standardization
original_name
column
original_name_YYYY
column for each year or month availablename
columnbigquery_type
columndescription
column according to the manualtemporal_coverage
for each variable
covered_by_dictionary
directory_column
int64
or float64
, check if it's necessary to include a measurement unitWhen you finish filling in the architecture tables, contact the Data Basis team to validate everything. It's necessary that it's clear what the final format of the data should be before starting to write the code. This way we avoid redoing the work.
After validating the architecture tables, we can write the capture and cleaning codes for the data.
Capture: Code that automatically downloads all the original data and saves it in /input
. These data can be available on portals or links FTP, can be scraped from websites, among others.
Cleaning: Code that transforms the original data saved in /input
into clean data, saves it in the /output
folder, to be later uploaded to DB.
Each clean table for production can be saved as a single file or, if it's very large (e.g. above 200 mb), it can be partitioned in the Hive format in several sub-files. The accepted formats are .csv
or .parquet
. Our recommendation is to partition tables by year
, month
, and state_abbreviation
. The partitioning is done through the folder structure, see the example below to visualize how.
The microdados_vinculos
table from RAIS, for example, is a very large table (+250GB) so we partition it by year
and state_abbreviation
. The partitioning was done using the folder structure /microdados_vinculos/year=YYYY/state_abbreviation=XX
.
.py
, .R
, ...) or notebooks (Google Colab, Jupyter, Rmarkdown, etc).<dataset_id>
), that is, they must not depend on the paths of your
computer.The cleaning code was built in R and can be consulted here.
The cleaning code was built in Python can be consulted here
It's common for databases to be made available with auxiliary files. These can include technical notes, collection and sampling descriptions, etc. To help users of Data Basis have more context and understand the data better, organize all these auxiliary files in /extra/auxiliary_files
.
Feel free to structure sub-folders as you like there. What matters is that it's clear what these files are.
Sometimes, especially with old bases, there are multiple dictionaries in Excel or other formats. In Data Basis, we unify everything in a single file in .csv
format - a single dictionary for all columns of all tables in your dataset.
Important details on how to build your dictionary are in our style guide.
The complete dictionary can be consulted here. It already has the standard structure we use for dictionaries.
All set! Now all that's left is to upload to Google Cloud and send for review. For that, we'll use the basedosdados
client (available in Python) that facilitates the process.
Since there's a cost for storage in the storage, to finalize this step we'll need to make you available an api_key specifically for volunteers to upload the data to our development environment. So, join our Discord channel and call us in 'want-to-contribute'
7.1 Install our client in your terminal: pip install basedosdados
.
7.2 Run import basedosdados as bd
in python and follow the step-by-step process to configure locally with the credentials of your project in Google Cloud. Fill in the information as follows:
* STEP 1: y
* STEP 2: basedosdados-dev (put the .json passed by the bd team in the credentials folder)
* STEP 3: y
* STEP 4: basedosdados-dev
* STEP 5: https://api.basedosdados.org/api/v1/graphql
The data will pass through 3 places in Google Cloud:
7.3 Create the table in the GCS bucket and BigQuery-DEV-staging, using the Python API, as follows:
import basedosdados as bd
tb = bd.Table(
dataset_id='<dataset_id>',
table_id='<table_id>')
tb.create(
path='<path_to_the_data>',
if_table_exists='raise',
if_storage_data_exists='raise',
)
The following parameters can be used:
path
(required): the complete path of the file on your computer, like: /Users/<your_username>/projects/basedosdados/sdk/bases/[DATASET_ID]/output/microdados.csv
.If your data is partitioned, the path must point to the folder where the partitions are. Otherwise, it must point to a.csv
file (for example, microdados.csv).
force_dataset
: command that creates the dataset configuration files in BigQuery.
if_table_exists
: command used if the table already exists in BQ:
if_storage_data_exists
: command used if the data already exists in Google Cloud Storage:
If the project doesn't exist in BigQuery, it will be automatically created
Consult our API for more details on each method.
7.4 Create the .sql and schema.yml files from the architecture table following this documentation
If you need, at this moment you can change the SQL query to perform final treatments from thestaging
table, you can include columns, remove columns, perform algebraic operations, substitute strings, etc. SQL is the limit!
7.5 Run and test the models locally following this documentation
7.6 Upload the table metadata to the site:
For now, only the data team has permissions to upload the table metadata to the site, so it will be necessary to contact us. We're already working to make it possible for volunteers to update data on the site in the near future.
7.7 Upload the auxiliary files:
st = bd.Storage(
dataset_id = <dataset_id>,
table_id = <table_id>)
st.upload(
path='caminho_para_os_auxiliary_files',
mode = 'auxiliary_files',
if_exists = 'raise')
Yay, that's it! Now all that's left is to send everything for review in the repository of Data Basis.
cd
to the local folder of the repository and open a new branch with git checkout -b [dataset_id]
. All additions and modifications will be included in this branch.table_id.sql
in the queries-basedosdados/models/dataset_id/
folder by copying the queries you developed in step 7.queries-basedosdados/dbt_project.yaml
file (don't forget to follow the alphabetical order to not mess up the organization)queries-basedosdados/models/dataset_id/code
folderAnd now? Our team will review the data and metadata submitted via Github. We can contact you to ask questions or request changes to the code. When everything is OK, we'll do a merge of your pull request and the data will be automatically published on our platform!
Data Basis