Skip to content

Commit 2a1a7f0

Browse files
authored
Merge pull request #93 from staadecker/docs
Add documentation on Pandas
2 parents 9aa7d44 + c7fc7c4 commit 2a1a7f0

File tree

3 files changed

+159
-0
lines changed

3 files changed

+159
-0
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@ In `docs/`:
2323

2424
- [`Numerical Issues.md`](/docs/Numerical%20Issues.md): Information about detecting and resolving numerical issues.
2525

26+
- [`Pandas.md`](/docs/Pandas.md): Crash course on the Pandas data manipulation library.
27+
2628
Finally, you can generate documentation for the SWITCH modules by running `pydoc -w switch_model` after having installed
2729
SWITCH. This will build HTML documentation files from python doc strings which
2830
will include descriptions of each module, their intentions, model

docs/Pandas.md

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# Using Pandas
2+
3+
[Pandas](https://pandas.pydata.org/) is a Python library that is used for data analysis and manipulation.
4+
5+
In SWITCH, Pandas is mainly used to create graphs and also output files after solving.
6+
7+
This document gives a brief overview of key concepts and commands
8+
to get started with Pandas. There are a lot better resources available
9+
online teaching Pandas, including entire online courses.
10+
11+
Most importantly, the Pandas [documentation](https://pandas.pydata.org/docs/)
12+
and [API reference](https://pandas.pydata.org/docs/reference/index.html#api) should be your go-to
13+
when trying to learn something new about Pandas.
14+
15+
## Key Concepts
16+
17+
### DataFrame
18+
19+
Dataframes is the main Pandas data structure and is responsible for
20+
storing tabular data.
21+
Dataframes have rows, columns and labelled axes (e.g. row or column names).
22+
When manipulating data,
23+
the common practice is to store your main dataframe in a variable called `df`.
24+
25+
### Series
26+
27+
A series can be thought of as a single column in a dataframe.
28+
It's a 1-dimensional array of values.
29+
30+
### Indexes
31+
32+
Pandas has two ways of working with dataframes: with or without custom indexes.
33+
Custom indexes are essentially labels for each row. For example, the following
34+
dataframe has 4 columns (A, B, C, D) and a custom index (the date).
35+
36+
```
37+
A B C D
38+
2000-01-01 0.815944 -2.093889 0.677462 -0.982934
39+
2000-01-02 -1.688796 -0.771125 -0.119608 -0.308316
40+
2000-01-03 -0.527520 0.314343 0.852414 -1.348821
41+
2000-01-04 0.133422 3.016478 -0.443788 -1.514029
42+
2000-01-05 -1.451578 0.455796 0.559009 -0.247087
43+
```
44+
45+
The same dataframe can be expressed without the custom index as follows.
46+
Here the date is a column just like the others and the index is the
47+
default index (just the row number).
48+
49+
```
50+
date A B C D
51+
0 2000-01-01 0.815944 -2.093889 0.677462 -0.982934
52+
1 2000-01-02 -1.688796 -0.771125 -0.119608 -0.308316
53+
2 2000-01-03 -0.527520 0.314343 0.852414 -1.348821
54+
3 2000-01-04 0.133422 3.016478 -0.443788 -1.514029
55+
4 2000-01-05 -1.451578 0.455796 0.559009 -0.247087
56+
```
57+
58+
Using custom indexes is quite powerful but more advanced. When starting
59+
out it's best to avoid custom indexes.
60+
61+
### Chaining
62+
63+
Every command you apply on a dataframe *returns* a new dataframe.
64+
That is commands *do not* modify the dataframe they're called on.
65+
66+
For example, the following has no effect.
67+
68+
`df.groupby("country")`
69+
70+
Instead, you should always update your variable with the returned result.
71+
For example,
72+
73+
`df = df.groupby("country")`
74+
75+
This allows you to "chain" multiple operations together. E.g.
76+
77+
`df = df.groupby("country").rename(...).some_other_command(...)`
78+
79+
## Useful commands
80+
81+
- `df = pandas.read_csv(filepath, index_col=False)`. This command
82+
reads a csv file from filepath and returns a dataframe that gets stored
83+
in `df`. `index_col=False` ensures that no custom index is automatically
84+
created.
85+
86+
- `df.to_csv(filepath, index=False)`.
87+
This command will write a dataframe to `filepath`. `index=False` means
88+
that the index is not written to the file. This should
89+
be used if you're not using custom indexes since you probably don't
90+
want the default index (just the row numbers) to be outputted to your csv.
91+
92+
- `df["column_name"]`: Returns a *Series* containing the values for that column.
93+
94+
- `df[["column_1", "column_2"]]`: Returns a *DataFrame* containing only the specified columns.
95+
96+
- `df[df["column_name"] == "some_value"]`: Returns a dataframe with only the rows
97+
where the condition in the square brackets is met. In this case we filter out
98+
all the rows where the value under `column_name` is not `"some_value"`.
99+
100+
- `df.merge(other_df, on=["key_1", "key_2"])`: Merges `df` with `other_df`
101+
where the columns over which we are merging are `key_1` and `key_2`.
102+
103+
- `df.info()`: Prints the columns in the dataframe and some info about each column.
104+
105+
- `df.head()`: Prints the first few rows in the dataframe.
106+
107+
- `df.drop_duplicates()`: Drops duplicate rows from the dataframe
108+
109+
- `Series.unique()`: Returns a series where duplicate values are dropped.
110+
111+
## Example
112+
113+
This example shows how we can use Pandas to generate a more useful view
114+
of our generation plants from the SWITCH input files.
115+
116+
```python
117+
import pandas as pd
118+
119+
# READ
120+
kwargs = dict(
121+
index_col=False,
122+
dtype={"GENERATION_PROJECT": str}, # This ensures that the project id column is read as a string not an int
123+
)
124+
gen_projects = pd.read_csv("generation_projects_info.csv", *kwargs)
125+
costs = pd.read_csv("gen_build_costs.csv", *kwargs)
126+
predetermined = pd.read_csv("gen_build_predetermined.csv", *kwargs)
127+
128+
# JOIN TABLES
129+
gen_projects = gen_projects.merge(
130+
costs,
131+
on="GENERATION_PROJECT",
132+
)
133+
134+
gen_projects = gen_projects.merge(
135+
predetermined,
136+
on=["GENERATION_PROJECT", "build_year"],
137+
how="left" # Makes a left join
138+
)
139+
140+
# FILTER
141+
# When uncommented will filter out all the projects that aren't wind.
142+
# gen_projects = gen_projects[gen_projects["gen_energy_source"] == "Wind"]
143+
144+
# WRITE
145+
gen_projects.to_csv("projects.csv", index=False)
146+
```
147+
148+
If you run the following code snippet in the `inputs folder` it will create a `projects.csv` file
149+
containing the project data, cost data and prebuild data all in one file.

docs/Performance.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,14 @@ Current solution* refers to the solution you are trying to find while using the
6161
- `--save-warm-start` and `--warm-start` both use an extension of the `gurobi_direct` solver interface which is
6262
generally slower than the `gurobi` solver interface (see section above).
6363

64+
## Model formulation
65+
66+
The way the model is formulated often has an impact on performance. Here are some rules of thumb.
67+
68+
- For constraints, it is faster to use `<=` and `>=` rather than `==` when possible. If your constraint
69+
should be an equality, try to think about whether it is already being pushed against one of the bounds
70+
by the objective function.
71+
6472
## Tools for improving performance
6573

6674
- [Memory profiler](https://pypi.org/project/memory-profiler/) for generating plots of the memory

0 commit comments

Comments
 (0)