The problem in brief
I wrote some Python code that I run within a Docker container. On my desktop at work, which runs KUbuntu 18.04, the code executes without complaint. On my laptop it used to, when I ran the same OS. Now it runs NixOS 19.03 (here’s my configuration.nix file) and the code is reporting an EOF inside string
error.
The configurations should be identical. I edited i18n.extraLocaleSettings
in configuration.nix
to be identical to what I see when I evaluate locale
on the Ubuntu system. I copied the data and ran diff to make sure it copied correctly. The code is running in a Docker container I built on the work machine, uploaded to DockerHub, and downloaded to the home machine.
A possibly interesting non-solution
If in the code I import the csv
library and then add the option quoting=csv.QUOTE_NONE
to the pandas.read_csv
command that complains, it stops complainning. But it still doesn’t read the file correctly. It does okay for a good number of rows, but when it has to parse the string “Bogotá, D.C.” it gets confused:
... # I'm skipping hundreds of lines
97161.0,CARURU,97.0,VAUPÉS # good
97666.0,TARAIRA,97.0,VAUPÉS # good
99001.0,PUERTO CARREÑO,99.0,VICHADA # good
99524.0,LA PRIMAVERA,99.0,VICHADA # good
99624,SANTA ROSALÍA,99,VICHADA # good
99773.0,CUMARIBO,99.0,VICHADA # good
" D.C.""",11001,11,"""BOGOTÁ" # ridiculous
" PROVIDENCIA Y SANTA CATALINA""",88564,88,"""ARCHIPIÉLAGO DE SAN ANDR
ÉS"
" SOBRETASA CIGARRILLOS""",19197953,"""OTRAS RENTAS CEDIDAS SALUD. IVA
", JUEGOS DE SUERTE Y AZAR
"""ARCHIPIÉLAGO DE SAN ANDRÉS"," PROVIDENCIA Y SANTA CATALINA"""," PRO
VIDENCIA Y SANTA CATALINA""",88
The error in detail
On my laptop, the program reads the bulk of the files with no problem:
(base) root@127:/mnt# make all subsample=1
PYTHONPATH='.' python3 Code/build/make_keys.py
data/sisfut/original_csv/2012_ingresos.csv
data/sisfut/original_csv/2013_ingresos.csv
data/sisfut/original_csv/2014_ingresos.csv
data/sisfut/original_csv/2015_ingresos.csv
data/sisfut/original_csv/2016_ingresos.csv
data/sisfut/original_csv/2017_ingresos.csv
data/sisfut/original_csv/2018_ingresos.csv
data/sisfut/original_csv/2012_inversion.csv
data/sisfut/original_csv/2013_inversion.csv
data/sisfut/original_csv/2014_inversion.csv
data/sisfut/original_csv/2015_inversion.csv
data/sisfut/original_csv/2016_inversion.csv
(it prints the name of each file before trying to read it),
but then reports an EOF inside string
error:
data/sisfut/original_csv/2017_inversion.csv
Traceback (most recent call last):
File "Code/build/make_keys.py", line 23, in <module>
"Concepto"
File "/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py", line 685, in parser_f
return _read(filepath_or_buffer, kwds)
File "/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py", line 463, in _read
data = parser.read(nrows)
File "/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1154, in read
ret = self._engine.read(nrows)
File "/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py", line 2059, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 896, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 950, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 937, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2132, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 92182
Makefile:126: recipe for target 'output/keys/budget.csv' failed
make: *** [output/keys/budget.csv] Error 1
If I print the offending line and its two neighbors to the screen, I see nothing wrong (the first line of the file contains column names, which is why I use head -n 92184
when the error is at line 92182
):
(base) root@127:/mnt# head -n 92184 data/sisfut/original_csv/2017_inversion.csv | tail -n 3
214819548,PIENDAMÓ,19,CAUCA,19548,PIENDAMÓ,A.14.1.5,PROGRAMA DE ATENCION INTEGRAL A LA PRIMERA INFANCIA,100.0,"INGRESOS CORRIENTES DE LIBRE DESTINACION EXCEPTO EL 42% DE LIBRE DESTINACIÓN DE PROPOSITO GENERAL DE MPIOS DE CATEGORIA 4, 5 Y 6",10000000,10000000,0,0,0
212441524,PALERMO,41,HUILA,41524,PALERMO,A.14.1.5,PROGRAMA DE ATENCION INTEGRAL A LA PRIMERA INFANCIA,100.0,"INGRESOS CORRIENTES DE LIBRE DESTINACION EXCEPTO EL 42% DE LIBRE DESTINACIÓN DE PROPOSITO GENERAL DE MPIOS DE CATEGORIA 4, 5 Y 6",15000000,15000000,13974800,13974800,13974800
211541615,RIVERA,41,HUILA,41615,RIVERA,A.14.1.5,PROGRAMA DE ATENCION INTEGRAL A LA PRIMERA INFANCIA,100.0,"INGRESOS CORRIENTES DE LIBRE DESTINACION EXCEPTO EL 42% DE LIBRE DESTINACIÓN DE PROPOSITO GENERAL DE MPIOS DE CATEGORIA 4, 5 Y 6",0,3600000,3566666,3566666,3566666
(base) root@127:/mnt#
Then again maybe it’s not counting lines correctly, so the error lies elsewhere.
Here’s the python code I’m running:
import os
from itertools import chain
import pandas as pd
import Code.metadata.four_series as sm
source_data = pd.DataFrame()
for series in sm.series:
for year in range( 2012, 2018+1 ):
filename = ( sm.source_folder + "original_csv/"
+ str(year) + "_" + series + ".csv" )
print(filename)
shuttle = pd.read_csv(
filename,
usecols = [
"Cód. DANE Municipio",
"Nombre DANE Municipio",
"Cód. DANE Departamento",
"Nombre DANE Departamento",
"Código Concepto",
"Concepto"
] )
source_data = source_data.append( shuttle )