Python in Docker in NixOS can't read something that it can in the same Docker container in Ubuntu

The problem in brief

I wrote some Python code that I run within a Docker container. On my desktop at work, which runs KUbuntu 18.04, the code executes without complaint. On my laptop it used to, when I ran the same OS. Now it runs NixOS 19.03 (here’s my configuration.nix file) and the code is reporting an EOF inside string error.

The configurations should be identical. I edited i18n.extraLocaleSettings in configuration.nix to be identical to what I see when I evaluate locale on the Ubuntu system. I copied the data and ran diff to make sure it copied correctly. The code is running in a Docker container I built on the work machine, uploaded to DockerHub, and downloaded to the home machine.

A possibly interesting non-solution

If in the code I import the csv library and then add the option quoting=csv.QUOTE_NONE to the pandas.read_csv command that complains, it stops complainning. But it still doesn’t read the file correctly. It does okay for a good number of rows, but when it has to parse the string “Bogotá, D.C.” it gets confused:

...                                 # I'm skipping hundreds of lines
97161.0,CARURU,97.0,VAUPÉS          # good
97666.0,TARAIRA,97.0,VAUPÉS         # good
99001.0,PUERTO CARREÑO,99.0,VICHADA # good
99524.0,LA PRIMAVERA,99.0,VICHADA   # good
99624,SANTA ROSALÍA,99,VICHADA      # good
99773.0,CUMARIBO,99.0,VICHADA       # good
" D.C.""",11001,11,"""BOGOTÁ"       # ridiculous
" PROVIDENCIA Y SANTA CATALINA""",88564,88,"""ARCHIPIÉLAGO DE SAN ANDR
ÉS"             
" SOBRETASA CIGARRILLOS""",19197953,"""OTRAS RENTAS CEDIDAS SALUD. IVA
", JUEGOS DE SUERTE Y AZAR
"""ARCHIPIÉLAGO DE SAN ANDRÉS"," PROVIDENCIA Y SANTA CATALINA"""," PRO
VIDENCIA Y SANTA CATALINA""",88

The error in detail

On my laptop, the program reads the bulk of the files with no problem:

(base) root@127:/mnt# make all subsample=1
PYTHONPATH='.' python3 Code/build/make_keys.py
data/sisfut/original_csv/2012_ingresos.csv
data/sisfut/original_csv/2013_ingresos.csv
data/sisfut/original_csv/2014_ingresos.csv
data/sisfut/original_csv/2015_ingresos.csv
data/sisfut/original_csv/2016_ingresos.csv
data/sisfut/original_csv/2017_ingresos.csv
data/sisfut/original_csv/2018_ingresos.csv
data/sisfut/original_csv/2012_inversion.csv
data/sisfut/original_csv/2013_inversion.csv
data/sisfut/original_csv/2014_inversion.csv
data/sisfut/original_csv/2015_inversion.csv
data/sisfut/original_csv/2016_inversion.csv

(it prints the name of each file before trying to read it),
but then reports an EOF inside string error:

data/sisfut/original_csv/2017_inversion.csv
Traceback (most recent call last):
  File "Code/build/make_keys.py", line 23, in <module>
    "Concepto"

  File "/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py", line 685, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py", line 463, in _read
    data = parser.read(nrows)
  File "/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1154, in read
    ret = self._engine.read(nrows)
  File "/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py", line 2059, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 896, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 950, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 937, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2132, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 92182
Makefile:126: recipe for target 'output/keys/budget.csv' failed
make: *** [output/keys/budget.csv] Error 1

If I print the offending line and its two neighbors to the screen, I see nothing wrong (the first line of the file contains column names, which is why I use head -n 92184 when the error is at line 92182):

(base) root@127:/mnt# head -n 92184 data/sisfut/original_csv/2017_inversion.csv | tail -n 3
214819548,PIENDAMÓ,19,CAUCA,19548,PIENDAMÓ,A.14.1.5,PROGRAMA DE ATENCION INTEGRAL A LA PRIMERA INFANCIA,100.0,"INGRESOS CORRIENTES DE LIBRE DESTINACION EXCEPTO EL 42% DE LIBRE DESTINACIÓN DE PROPOSITO GENERAL DE MPIOS DE CATEGORIA 4, 5 Y 6",10000000,10000000,0,0,0
212441524,PALERMO,41,HUILA,41524,PALERMO,A.14.1.5,PROGRAMA DE ATENCION INTEGRAL A LA PRIMERA INFANCIA,100.0,"INGRESOS CORRIENTES DE LIBRE DESTINACION EXCEPTO EL 42% DE LIBRE DESTINACIÓN DE PROPOSITO GENERAL DE MPIOS DE CATEGORIA 4, 5 Y 6",15000000,15000000,13974800,13974800,13974800
211541615,RIVERA,41,HUILA,41615,RIVERA,A.14.1.5,PROGRAMA DE ATENCION INTEGRAL A LA PRIMERA INFANCIA,100.0,"INGRESOS CORRIENTES DE LIBRE DESTINACION EXCEPTO EL 42% DE LIBRE DESTINACIÓN DE PROPOSITO GENERAL DE MPIOS DE CATEGORIA 4, 5 Y 6",0,3600000,3566666,3566666,3566666
(base) root@127:/mnt# 

Then again maybe it’s not counting lines correctly, so the error lies elsewhere.

Here’s the python code I’m running:

import os
from itertools import chain
import pandas as pd
import Code.metadata.four_series as sm

source_data = pd.DataFrame()
for series in sm.series:
  for year in range( 2012, 2018+1 ):
    filename = ( sm.source_folder + "original_csv/"
                 + str(year) + "_" + series + ".csv" )
    print(filename)
    shuttle = pd.read_csv(
      filename,
      usecols = [
        "Cód. DANE Municipio",
        "Nombre DANE Municipio",
        "Cód. DANE Departamento",
        "Nombre DANE Departamento",
        "Código Concepto",
        "Concepto"
      ] )
    source_data = source_data.append( shuttle )
1 Like

pandas.read_csv has optional encoding argument - have you tried setting it to the encoding of your file?

When I run file on the files to read they are all reported as either UTF-8 Unicode text or UTF-8 Unicode text, with very long lines. UTF-8 is, I believe, the default format Pandas uses. However, if I add encoding="UTF-8", it behaves no differently.

(Well, almost no differently. The for loop loops over a difference in two sets. For some reason changing the encoding option changes the order in which it processes the resulting set – which is weird to me, but doesn’t seem to bear on the read problem.)

I think these differences in how file sees that file are clues to the source of the problem. Do you use exactly the same file in all of the scenarios? I mean, do you use Docker volumes or “mounts” of any kind?

I have a wild guess that you are having a discrepancy with new lines (\r vs \n) in your file. This may happen when you edit it with different editors / with different settings.

Besides that, according to my experience, it’s not that simple to handle the locale the way you do. I once encountered a different behavior of a program because I used quotes and I shouldn’t. And what confused me was that locale printed the quotes.

It was a data corruption issue. How embarrassing. Thanks everyone, and sorry for the false alarm.