Ability to cross-reference your databases with data from different organizations in a simple and easy way. There are already hundreds of public data sets from the largest organizations in Brazil and around the world present in our datalake.
Commitment to transparency, data quality, and the development of better research, analysis, and solutions for society. Not only do we democratize access to open data, but also to quality data. We have a specialized team that reviews and ensures the quality of the data added to the datalake.
Participation in an ever-growing community: thousands of journalists, researchers, and developers already use and follow the Database.
Want to upload data to the database and help us build this repository? Great! We've organized everything you need in the manual below in 8 steps
To make the explanation easier, we will follow a ready-made example with data from RAIS.
You can navigate through the steps in the menu on the leftWe strongly suggest that you join our Discord channel to ask questions and interact with the team and other contributors! 😉
Some knowledge is required to perform this process:
Don't have any of these skills, but want to collaborate?We have a data team that can help you. Just join our Discord and send a message to #quero-contribuir.
We maintain a list of datasets for volunteers on our Github. To start uploading a database that interests you, simply open a new data issue. If your database (set) is already listed, just mark your Github user as assignee.
Your first task is to fill in the information in the issue. This information will help you better understand the data and will be very useful for processing and filling in metadata.
When you have completed this step, call someone from the data team so that the information you have mapped about the set can be entered into our website!
Download the
template folder here
and rename it to <dataset_id> (defined in the issue in step 1). This template folder facilitates and organizes all the
steps from here on out. Its
structure is as follows:
<dataset_id>/
code/: Codes needed for capturing and cleaning the data
(we'll see more in step
4).input/: Contains all files with original data, exactly
as downloaded from the primary source. (we will see more in step
4).output/: Final files, already in a format ready to be uploaded to the database (we will see more in step
4).tmp/: Any temporary files created by the code in /code during the cleaning and processing process ( we will see more in step
4).extra/architecture/: Architecture tables (we will see more in step 3).
auxiliary_files/: Auxiliary files for the data (we will see more in step 5).dictionary.csv: Dictionary table for the entire dataset (we will see more in step 6).Only thecodefolder will be committed to your project; the other files will exist only locally or in Google Cloud.
The architecture tables determine the structure of each table in your dataset. They define, for example, the name, order, and metadata of the variables, as well as compatibility when there are changes in versions (for example, if a variable changes its name from one year to the next).
Each table in the dataset must have its own architecture table (spreadsheet), which must be filled out in Google Drive to allow our data team to correct it.
The RAIS architecture tables can be found here. They are a great reference to get you started, as they contain many variables and examples of different situations you may encounter.
A each beginning and end of step, consult our style guide to ensure you're following the BD standardization
original_name columnoriginal_name_YYYY column for each available year or monthname columnbigquery_type columndescription according to the [manual] (style_data)temporal_coverage for each variablecovered_by_dictionarydirectory_columnint64 or float64, check if it is necessary to include a unit of measurementWhen you have finished filling in the architecture tables, contact the Database team to validate everything.It is necessary to be clear about the final format that the data should be in before starting to write the code. This way we avoid rework.
Once the architecture tables have been validated, we can write the codes for capturing and cleaning the data.
Capture: Code that automatically downloads all the original data and saves it in /input. This data may be available on portals or FTP links, or it may be scraped from websites, among other sources.
Cleaning: Code that transforms the original data saved in /input into clean data, saves it in the /output folder, and then uploads it to the database.
Each cleaned table for production can be saved as a single file or, if it is too large (e.g., over 200 MB), it can be partitioned in Hive format into several subfiles. The accepted formats are .csv or .parquet. We recommend partitioning tables by year, month, and state_code. Partitioning is done through the folder structure. See the example below to see how.
The microdados_vinculos table from RAIS Vinculos, for example, is a very large table (+400GB), so we partitioned it by year and sigla_uf. Partitioning was done using the folder structure /microdados_vinculos/year=YYYY/sigla_uf=XX.
.py, .R, ...) or notebooks (Google Colab, Jupyter, Rmarkdown, etc).<dataset_id>), i.e., they must not depend on the paths on your
computer.The cleaning code was built in R and can be found here.
The cleaning code was built in Python and can be found here.
It is common for databases to be made available with auxiliary files. These may include technical notes, collection and sampling descriptions, etc. To help database users gain more context and better understand the data, organize all these auxiliary files in /extra/auxiliary_files.
Feel free to structure subfolders as you wish within this directory. The important thing is that it is clear what these files are.
Often, especially with older databases, there are multiple dictionaries in Excel or other formats. In the Database, we unify everything into a single file in .csv format—a single dictionary for all columns in all tables in your set.
Important details on how to build your dictionary are in our style manual.
The complete dictionary can be consulted here. It already has the standard structure we use for dictionaries.
All done! Now all that's left is to upload it to Google Cloud and send it for review.
To do this, we will use the basedosdados client (available in Python), which facilitates the settings and steps of the process.
Since there is a cost for storage, to complete this step we will need to provide you with a specific api_key for volunteers to upload the data to our development environment. So, join our Discord channel, call us at ‘quero-contribuir’ and tag@equipe_dados
7.1 Install our client on your terminal: pip install basedosdados.
7.2 Run import basedosdados as bd in Python and follow the step-by-step instructions to configure locally with your Google Cloud project credentials. Fill in the information as follows:
* STEP 1: y
* STEP 2: basedosdados-dev (place the .json file provided by the bd team in the credentials folder)
* STEP 3: y
* STEP 4: basedosdados-dev
* STEP 5: https://api.basedosdados.org/api/v1/graphql
The data will pass through three places in Google Cloud:
7.3 Create the table in the GCS bucket and BigQuery-DEV-staging, using the Python API, as follows:
import basedosdados as bd
DATASET_ID = “dataset_id” # Name of the dataset
TABLE_ID = “table_id” # Name of the table
tb = bd.Table(dataset_id=DATASET_ID, table_id=TABLE_ID)
tb.create(
path=path_to_data, # Path to the csv or parquet file
if_storage_data_exists="raise",
if_table_exists="replace",
source_format="csv"
)
If your data is partitioned, the path must point to the folder where the partitions are. Otherwise, it must point to a.csvfile (for example, microdados.csv).If the project does not exist in BigQuery, it will be automatically created.
7.4 Create the .sql and schema.yml files from the architecture table to run materialization and tests in dbt (data build-tool):
from databasers_utils import TableArchitecture
arch = TableArchitecture(
dataset_id="<dataset-id>",
tables={
"<table-id>": "URL of the Google Sheet architecture", # Example https://docs.google.com/spreadsheets/d/1K1svie4Gyqe6NnRjBgJbapU5sTsLqXWTQUmTRVIRwQc/edit?usp=drive_link
},
)
# Creates the yaml file
arch.create_yaml_file()
# Creates the sql files
arch.create_sql_files()
# Updates the dbt_project.yml
arch.update_dbt_project()
If necessary, at this point you can change the SQL query to perform final processing from thestagingtable. You can add columns, remove columns, perform algebraic operations, replace strings, etc. SQL is the limit!
7.5 Using DBT
The dbt sql files use the set_datalake_project macro, which indicates which project (databases-staging or databases-dev) the data will be consumed from. When creating the files using the create_sql_files function, the macro will be inserted.
select
col_name
from {{ set_datalake_project(“<DATASET_ID>_staging.<TABLE_ID>”) }}
Materializes a single model by name in basedosdados-dev consuming data from basedosdados-dev.{table_id}_staging
dbt run --select dataset_id__table_id
Materialize all models in a folder in basedosdados-dev consuming data from basedosdados-dev.{table_id}_staging
dbt run --select model.dateset_id.dateset_id__table_id
Materialize all models in the path in basedosdados-dev consuming data from basedosdados-dev.{table_id}_staging
dbt run --select models/dataset_id
Materialize a single model by the sql file path in basedosdados-dev consuming data from basedosdados-dev.{table_id}_staging
dbt run --select models/dataset/table_id.sql
Tests a single model
dbt test --select dataset_id__table_id
Tests all models in a folder
dbt test --select model.dateset_id.dateset_id__table_id
Tests all models in the path
dbt test --select models/dataset_id
7.6 Upload table metadata to the website:
For now, only the data team has permission to upload table metadata to the website, so you will need to contact us. We are already working to enable volunteers to update data on the website in the near future.
Phew, that's it! Now all that's left is to send everything for review to the repository in the Database.
cd to navigate to the local repository folder and open a new branch with git checkout -b [dataset_id]. All additions and modifications will be included in this branch.dataset__table_id.sql in the pipelines/models/dataset_id/ folder by copying the queries and schema you created in step 7.pipelines/models/dataset_id/code folder.What now? Our team will review the data and metadata submitted via GitHub. We may contact you to ask questions or request changes to the code. When everything is OK, we will merge your pull request and the data will be automatically published on our platform!
Data Basis