Update snapshot

15066700 · Samuel Maier · 657cd9fe · 15066700 · 15066700 · 15066700
Commit 15066700 authored Aug 25, 2023 by Samuel Maier
--- a/common_py/README.md
+++ b/common_py/README.md
@@ -12,4 +12,6 @@ I won't necessarily release all project data (certainly not my tex files), so no
 # Installation
-See Top-Level [README](../README.md)
+See Top-Level [README](../README.md).
\ No newline at end of file
+As of this writing, this directory doesnt contain anything that gets executed on its own, it only provides common things to the other repos.
\ No newline at end of file
--- a/data_explore/README.md
+++ b/data_explore/README.md
-# Tbis directory
+# This directory
-This directory contains a number of standalone files used to generate plots, tables, etc.
+This directory contains a number of standalone python scripts used to generate plots, tables, etc.
+It may not contain all the code that was used.
+In addition some plots may heve been generated with old versions of these scripts.
+In many scripts you have to change hardcoded values to generate specific plots.
 # Installation

--- a/enem_aggregate/README.md
+++ b/enem_aggregate/README.md
@@ -5,12 +5,13 @@ This code does most of the grunt work of aggregating the massive interaction fil
 It does *not* necessarily put the results into the concrete format used by the models. That is done in Python to allow for more model-specific customization in the model repository itself.
 Originally, this code was intended to run in a streaming fashion on any PC.
-This was done in Rust mainly for 3 reasons, unfortunately 2 of them didn't really work out:
+However, the streaming did not work out, so part of it is now done on the cluster.
+Since there was already a lot of code when we gave up we kept it around, as I can not spare the time to port it to Python.
+This was done in Rust mainly for 3 reasons, unfortunately 2 of them didn't really work out either:
 * I wanted to avoid having to learn and use yet another domain-specific language to do efficient data wrangling as required in Python. Unfortunately, Polars makes it difficult to use standard Rust tools, so there is still a lot of that here as well.
 * We noticed that the Python program became difficult to abort when we ran out of system resources. The hope was that this was due to some inter-process communication or something going on with the Python polars API to core polars (which is implemented in rust). If there was indeed an improvement in this regard, it was not noticeable to me.
 * Debugging can be easier in Rust, if you're comfortable with using panics. This is because it embeds the location of each potential panic, and when it happens, displays that location. Polars make debugging difficult due to their lazy and declarative nature. This is the one thing where it worked. Unfortunately, there are still situations where Polars panics internally, which are hard to trace.
-However, this did not work out, so part of it is now done on the cluster.
-Since there was already a lot of code when we gave up we kept it around, asI can not spare the time to port it to Python.
 Because we're working with a lot of data that may or may not be what we expected, this code is written in a very script-like fashion.
@@ -81,7 +82,7 @@ One last thing I've done a lot in this code that people may not be aware of is t
 I will assume familiarity with pattern matching, there are a lot of resources on it, and I've also done it in Python.
 Rust has pattern matching, but in many places, e.g. a `let` pattern binding, it requires that the bound pattern is always true, so matching a vector that is known to the programmer but not to the compiler to contain 2 elements is still impossible.
-The `let-else' bindings have been adopted in Rust from Swift to allow the use of these `fallible' bindings.
+The `let-else` bindings have been adopted in Rust from Swift to allow the use of these `fallible` bindings.
 It takes the form: 
 ```rs
@@ -116,4 +117,8 @@ Unfortunately, these specific features are not very mature yet.
 In general, the `polars` API does not seem to be very stable at this point, and is also very much oriented towards the Python API, which means that it mostly doesn't take advantage of the niceties that Rusts strong type system can provide, while still requiring strong typing in other situations, which can make some things awkward.
 I'm almost certain that I've used a number of APIs in ways that the authors didn't expect, but didn't fail immediately, making it very hard to debug.
 The result is some awkward code and some very annoyed comments, sorry about that.
\ No newline at end of file
+During my time the Rust API already changed.
+It was easy to fix these errors, once the replacement was found.
+However with a longer timeframe this may get more difficult.
\ No newline at end of file
--- a/raw_data/enem/README.md
+++ b/raw_data/enem/README.md
@@ -5,8 +5,11 @@ Please refer to the [official_data README](./official_data/README.md) on how to
 Since the official data only contains the question body in pdf form, which is very difficult for us to use, we thankfully asked and recieved a csv dataset elsewhere that had these texts as plain text (with a bit of tex highlighting.).
 This format is described in [its README](./plaintext_questions_and_irt/README.md)
-MOST of the annormalities in the source data were fixed in our code, however some select were done so in the source.
+# Official data adjustments and considerations
-TODO: document what was changed, was only like 2 things, that didnt cause hidden errors.
-Applied fixes:
+MOST of the anomalities in the source data were fixed in our code, however some select ones were hard to fix on our end and done so in the source.
-* CO_ITEM = 58542 has the wrong answer for TX_COR = ROSA. Because I check for this to avoid errors on our end, I fixed that. Youre gonna have to search which year that was in yourself.
\ No newline at end of file
+Applied fixes (sadly there was no time to check if this is complete):
+* CO_ITEM = 58542 has the wrong answer for TX_COR = ROSA. Because I check for this to avoid errors on our end, I fixed that. Youre gonna have to search which year that was in yourself.
+Also consider the years we used [see the specific readme](./official_data/README.md#years-aggregated).
\ No newline at end of file
--- a/raw_data/enem/official_data/README.md
+++ b/raw_data/enem/official_data/README.md
@@ -7,18 +7,29 @@ Simply extracted in this directory, renaming the folder.
 Data is not here purely for size concerns.
 Its about 50 GB.
-## `[YEAR]/DADOS/ITENS_PROVA_[YEAR].csv` (`ITENS_PROVA`)
+## Years aggregated
+The aggregation was done on the years $[2009, 2021] \cap \mathbb{N}$.
+In 2022 questions were repeated in ENEM, so it provides no new data.
+Files outside of this range may have inconsistencies that were not considered.
+Some of the tests in the [aggregation code](../../../enem_aggregate/) were removed afterwards, and other tests were only done manually.
+So adding new years might cause the aggretation to fail and in anycase should be done with care, to not falsify the data.
+## Specific source files
+### `[YEAR]/DADOS/ITENS_PROVA_[YEAR].csv` (`ITENS_PROVA`)
 Contains metainfo to the questions
 Uses different separators!
-## `[YEAR]/DADOS/MICRODADOS_ENEM_[YEAR].csv` (`MICRODADOS_ENEM`)
+### `[YEAR]/DADOS/MICRODADOS_ENEM_[YEAR].csv` (`MICRODADOS_ENEM`)
 Contains interactions of students (with all questions of one topic in one entry, eg `NU_NOTA_[TOPIC]`).
 Again inconsistent separators!
-## `[YEAR]/DICIONÁRIO/Dicionário_Microdados_Enem_[YEAR].xlsx`(`dictionary`)
+### `[YEAR]/DICIONÁRIO/Dicionário_Microdados_Enem_[YEAR].xlsx`(`dictionary`)
 Contains further meta info, among others the correlation of the numbers in `MICRODADOS_ENEM/CO_PROVA_[TOPIC]` to the color of the students exam, which determined the order of the questions.