In this section we list all the standards from our style guide and data guidelines that we use at Data Basis. They help us maintain high quality in the data and metadata we publish.
You can use the left menu to navigate through the different topics on this page.
dataset_id
)We name datasets in the format <organization_id>_<description>
, where
organization_id
follows by default the geographic scope of the
organization that publishes the dataset:
organization_id | |
---|---|
Global | world_<organization> |
Federal | <country_code>_<organization> |
State | <country_code>_<state_code>_<organization> |
Municipal | <country_code>_<state_code>_<city>_<organization> |
country_code
and state_code
are always 2 lowercase letters;organization
is the name or acronym (preferably) of the organization that
published the original data (e.g., ibge
, tse
, inep
).description
is a brief description of the datasetFor example, the GDP dataset from IBGE has as dataset_id
: br_ibge_pib
Not sure how to name the organization?We suggest visiting their website and seeing how they refer to themselves (e.g., DETRAN-RJ would bebr_rj_detran
)
Naming tables is less structured and therefore requires good judgment. But we have some rules:
municipality_value
, state_value
.municipality
, not municipality_year
.school
, not schools
.microdata
. Generally these have data at the person or transaction level.dataset_id.table_id
Global | world_waze.alerts | Waze alert data from different cities. |
Federal | br_tse_elections.candidates | TSE political candidate data. |
Federal | br_ibge_pnad.microdata | Microdata from the National Household Sample Survey produced by IBGE. |
Federal | br_ibge_pnadc.microdata | Microdata from the Continuous National Household Sample Survey (PNAD-C) produced by IBGE. |
State | br_sp_see_teachers.workload | Anonymized workload of active teachers in SP state education network. |
Municipal | br_rj_riodejaneiro_cmrj_legislative.votes | Voting data from Rio de Janeiro City Council (RJ). |
Tables should, whenever possible, be in long
format, rather than wide
.
Variable names must follow some rules:
year
, month
, municipality_id
, state_code
, age
, position
, result
, votes
, revenue
, expense
, price
, etc._
.of
, the
, and
, in
, etc.id_
prefix when the variable represents primary keys of entities (that would eventually have a directory table).
municipality_id
, state_id
, school_id
, person_id
.network
, location
.person
, a column about municipal GDP would be called municipality_gdp
.person
, person characteristics would be called name
, age
, sex
, etc.name_
,date_
,number_
,quantity_
,proportion_
(percentage variables 0-100%),rate_
,ratio_
,index_
,indicator_
(boolean type variables),type_
,code_
,sequential_
._pc
(per capita)The rule is to keep variables with their original measurement units listed in this code, with the exception of financial variables where we convert old currencies to current ones (e.g. Cruzeiro to Real).
We catalog measurement units in standard format in the architecture table. Complete list here Examples: m
, km/h
, BRL
.
For deflated financial columns, we list the currency with the base year. Example: a column measured in reais in 2010 has unit BRL_2010
.
Variables must always have measurement units with base 1. In other words, having BRL
instead of 1000 BRL
, or person
instead of 1000 persons
. This information, as well as other column metadata, is recorded in the architecture table of the table.
We partially normalize our tables and have rules for which variables to include in production. They are:
municipality
from the table that already includes municipality_id
.year
and state_code
if the table is partitioned in these two dimensions.municipality_id
to tables that only include municipality_tse
.state_code
, municipality_id
) and (2) remove irrelevant keys (e.g. region
).Fill in the temporal_coverage
column in table, column, and key metadata (in dictionaries) according to the following pattern.
General format: initial_date(temporal_unit)final_date
initial_date
and final_date
are in the corresponding temporal unit.
year
has coverage 2005(1)2018
.month
has coverage 2005-08(1)2018-12
.week
has coverage 2005-08-01(7)2018-08-31
.day
has coverage 2005-08-01(1)2018-12-31
.Rules for filling in
initial_date
or final_date
are equal to the table's. In that case, leave it empty.2005(1)2018
.
2012(1)
.(1)2013
.(1)
..
(dot) instead of ,
(comma).YYYY-MM-DD
HH:MM:SS
YYYY-MM-DDTHH:MM:SS.sssZ
""
(csv), NULL
(Python), NA
(R), .
or ""
(Stata)In a nutshell, partitioning a table is dividing it into multiple blocks/parts. The central objective is to reduce financial costs and increase performance, as the larger the volume of data, the greater the storage and query costs.
The reduction in costs and the increase in performance mainly occur because partitioning allows the data set to be reorganized into small grouped blocks. In practice, by performing the partitioning, it is possible to avoid that a query traverses the entire table just to bring a small data slice.
An example of our beloved RAIS:
For this case, Bigquery scanned all (*) columns and rows of the dataset. It's worth noting that this cost is still not very large, as the base has already been partitioned. If this dataset hadn't passed through the partition process, this query would have cost a lot more money and time, as it involves a considerable volume of data.
Here, we filter by the partitioned columns year
and state_code
. As a result, Bigquery only queries and returns the values from the year folder and the state_code subfolder.
The first question that arises when dealing with partitioning is: from which number of lines a table should be partitioned? The documentation of GCP does not define a quantity x or y of lines that should be partitioned. The ideal is that tables are partitioned, with few exceptions. For example, tables with less than 10,000 lines, which will no longer receive data ingestion, do not have high storage and processing costs and, therefore, there is no need to be partitioned.
If the data is stored locally, it is necessary:
/output
folder, using the language you are using.Example of a partitioned table by year
and month
, using python
:
for year in [*range(2005, 2020)]:
for month in [*range(1, 13)]:
partition = output + f'table_id/year={year}/month={month}'
if not os.path.exists(partition):
os.makedirs(partition)
for year in [*range(2005, 2020)]:
for month in [*range(1, 13)]:
df_partition = df[df['year'] == year].copy() # The .copy is not necessary, it's just a good practice
df_partition = df_partition[df_partition['month'] == month]
df_partition.drop(['year', 'month'], axis=1, inplace=True) # It's necessary to exclude the columns used for partitioning
partition = output + f'table_id/year={year}/month={month}/table.csv'
df_partition.to_csv(partition, index=False, encoding='utf-8', na_rep='')
Examples of partitioned tables in R
:
Example of how to partition a table in SQL
:
CREATE TABLE `dataset_id.table_id` as (
year INT64,
month INT64,
col1 STRING,
col1 STRING
) PARTITION BY year, month
OPTIONS (Description='Description of the table')
The types of columns that BigQuery accepts as partitioning are:
TIMESTAMP
, DATE
or DATETIME
column.data/time
stamp when BigQuery processes the data.The types of columns that BigQuery does not accept as partitioning are: BOOL
, FLOAT64
, BYTES
, etc.
BigQuery accepts up to 4,000 partitions per table.
In our BD, tables are usually partitioned by: year
, month
, quarter
, and state_code
.
Note that when partitioning a table, it is necessary to exclude the corresponding column. Example: it is necessary to exclude the year
column when partitioning by year
.
Pull requests on Github should include a maximum of one dataset, but can include more than one base. In other words, they can involve one or more tables within the same dataset.
enum
default, we exclude the zeros to the left.
br_bd_diretorios_brasil.cbo_2002:cbo_2002
, which has six digits, because the first digit 0
means the category is from the grand group = "Members of the armed forces, police, and firefighters"
.br_inep_censo_escolar.stage:stage_education
, we exclude the zeros to the left. In other words, we change 01
to 1
.spatial_coverage
(spatial_coverage
), i.e., leave the field empty.temporal_coverage
(temporal_coverage
), i.e., leave the field empty.observation_level
(observation_level
), i.e., leave the field empty.Directories are the fundamental building blocks of our datalake. Our rules for managing directories are:
state
, municipality
, school
) and time-based units (e.g., data
, time
, day
, month
, year
).municipality:municipality_id
, state:state_code
.id_
prefix are reserved for primary keys of entities.See all the tables already available here.
spatial_coverage
(spatial_coverage
), which is the maximum spatial unit that the table covers. Example: sa.br, which means that the spatial aggregation level of the table is Brazil.temporal_coverage
(temporal_coverage
), i.e., leave the field empty.observation_level
(observation_level
), which consists of the observation level of the table, i.e., what each line represents.temporal_coverage
(temporal_coverage
) of the columns of the table, i.e., leave the field empty.The field refers to the data in the raw data source, which has not yet passed through the Data Basis methodology, i.e., our _input_
. When you click on it, the idea is to redirect the user to the original data source page. Our rules for managing the raw data sources are:
Educational Statistics: Open Data from Inep
, Penn World Tables: Groningen Growth and Development Centre
.Open an issue on our Github or send a message on Discord to talk to us :)
Data Basis