Home

Current work in the mr-recodes branch

This file has the contents of examples/recodes.py explaining the operations supported by it:

These are imports and helper functions needed to start off the script. the examples module is available only for these example scripts with fixtures.

from examples import NEWS_DATASET, NEWS_DATASET_ROWS, mr_in

from getpass import getpass
from scrunch import connect
from scrunch.datasets import create_dataset

HOST = 'https://alpha.crunch.io'

username = raw_input("Enter email: ")
password = getpass("Enter password for %s: " % username)

site = connect(username, password, site_url='%s/api/' % HOST)

To create a dataset, use the create_dataset function, accepts the same input that the body part of Crunch datasets catalog endpoint.

# Create a dataset for usage
dataset = create_dataset("Recodes example", NEWS_DATASET)
print("Dataset %s created" % dataset.id)

Scrunch support to add rows, only added so we can have data on this example. Gryphon currently uses Pycrunch directly to stream rows into Crunch.

# Add data rows
total = dataset.stream_rows(NEWS_DATASET_ROWS)
dataset.push_rows(total)

Examples of recode, it creates variables using Crunch's case function. It decides to create an array or single response based on the last attribute multiple. Scrunch will then construct a different Crunch expression to create either a simple case variable or an array with case subvariables as responses.

Note the mr_in helper, it was added because of a bug on the has_any() syntax but there is now a fix branch so this helper won't be needed anymore.

# Recode a new single response variable
agerange = dataset.recode([
    {'id': 1, 'name': 'Underage', 'rules': 'age < 18'},
    {'id': 2, 'name': 'Millenials', 'rules': 'age > 18 and age < 25'},
    {'id': 3, 'name': 'Gen X', 'rules': 'age < 35 and age >= 25'},
    {'id': 4, 'name': 'Grown ups', 'rules': 'age < 60 and age >= 35'},
    {'id': 5, 'name': '60+', 'rules': 'age >= 60'}
], alias='agerange', name='Age range', multiple=False)

print("Variable %s created" % agerange.alias)

# Recode a new multiple response variable from an existing multiple response variable
origintype = dataset.recode([
    {'id': 1, 'name': "Online",
     # Mixed support for using "category"(subvariables really) IDs
     'rules': mr_in(dataset, 'newssource', [1, 2, 3, 4])},  # Only in the helper
    {'id': 2, 'name': "Print", 'rules': mr_in(dataset, 'newssource', [5, 6])},
    {'id': 3, 'name': "Tv", 'rules': mr_in(dataset, 'newssource', [7, 9])},
    {'id': 4, 'name': "Radio", 'rules': mr_in(dataset, 'newssource', [8, 10])},
], alias='origintype', name="News source by type", multiple=True)

print("Variable %s created" % origintype.alias)

Existing exclusion filter support, this works in Master

# Add an exclusion filter
dataset.exclude('agerange == 1')  # Remove underage

Scrunch's variable copying has two internal implementations:

If the variable is derived, it will re-execute its derivation making a new variable with the same expression.
If the variable is not derived, it will make a copy_variable expression for it.

# Copy a variable
origintype_copy = dataset.copy_variable(origintype, name='Copy of origintype',
    alias='origintype_copy')
print("Variable %s created" % origintype_copy.alias)

The combine function is used to call either combine_responses on multiple response variables or combine_categories on single categoricals. It checks the input variable's type and generates the correct Crunch expression.

To edit a variable's combination, as shown in the snippet's comments, the edit_combination function will replace the variable's expression with a new one updating its definition. Note that this only works on variables that are product of combine, since this supports only combine_responses and combine_categories internally.

# Combine responses from origintype_copy, 4 is in the wrong place
onlinenewssource = dataset.combine(origintype_copy, [
    {"id": 1, "name": 'online', 'combined_ids': [1, 4]},
    {"id": 2, "name": 'notonline', 'combined_ids': [2, 3]}
], name='Online or not', alias='onlinenewssource')
print('Created combination: %s' % onlinenewssource.alias)

onlinenewssource.edit_combination([
    {"id": 1, "name": 'online', 'combined_ids': [1]},
    {"id": 2, "name": 'notonline', 'combined_ids': [2, 3, 4]}
])
print('Fixed combination: %s' % onlinenewssource.alias)

# Combine a single categorical - Combine with subvar 3 on the wrong place
over35 = dataset.combine(agerange, [
    {"id": 1, "name": 'under35', 'combined_ids': [1, 2], 'missing': False},
    {"id": 2, "name": 'over35', 'combined_ids': [3, 4, 5], 'missing': False}
], name='over 35?', alias='over35')
print('Created combination: %s' % over35.alias)

# Edit combination placing subvar 3 on the right group
over35.edit_combination([
    {"id": 1, "name": 'Under 35', 'combined_ids': [1, 2, 3], 'missing': False},
    {"id": 2, "name": 'Over 35', 'combined_ids': [4, 5], 'missing': False}
])
print('Fixed combination: %s' % over35.alias)
# Make a copy of a combined MR
all_ages = dataset.copy_variable(over35, name='All ages', alias='allages')
# Edit its responses to be all together on the same variable.
all_ages.edit_combination([
    {"id": 1, "name": 'all', 'combined_ids': [1, 2, 3, 4, 5]},
])
print("Edited combined variable: %s" % all_ages.alias)

Uses Pycrunch's export_dataset, we need to define how we will be exporting things and design our way out of multiple flags on this function.

# Export some rows
dataset.download("recodes.csv")

print('Visit on the web: %s' % dataset.web_url(HOST))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Clone this wiki locally