Published on

Coding challenge: Get useful data from a 1b rows text file with Python

Authors

Lately, I participated in a coding challenge by the very awesome moroccan tech community (geeksblabla) in their annual tech online conference. Here is the link of the coding challenge

What is the problem in a nutshell ?

You have a 1 billion row text file that looks like this

casa,tomato,6.23
casa,tomato,7.23
casa,tomato,8.23
casa,tomato,9.23
casa,potato,4.21
casa,flour,6.24
casa,oil,9.24
casa,oil,9.94
casa,oil,8.24

The first value is the name of the city, the second value is the name of the product, the third value is the price of the product. You have to get the name of the cheapest city and the total price of products in that city, separated by one space. On the following lines, you have to get the cheapest 5 products in that city. You can see the README for examples.

Python solution

from collections import defaultdict


def read_large_file(file_path, chunk_size=1024 * 1024):
    """Read input file by chunks"""

    with open(file_path, "r", encoding="utf-8") as file:
        while True:
            chunk = file.readline(chunk_size)
            if not chunk:
                break
            line = chunk.strip()
            if line:
                yield line

def format_float_number(number):
    """format float number to 2 decimals"""

    return "{:.2f}".format(number)


def main(input_path, output_path):
    """Main function"""

    city_totals = defaultdict(float)
    city_products = defaultdict(lambda: defaultdict(float))

    for line in read_large_file(input_path):
        city, product, price = line.split(",")
        price = float(price)
        city_totals[city] += price
        if not city_products[city][product] or price < city_products[city][product]:
            city_products[city][product] = price
    cheapest_city = min(city_totals, key=city_totals.get)
    cheapest_products = sorted(city_products[cheapest_city].items(), key=lambda x: (x[1], x[0]))[:5]
    output_lines = [f"{cheapest_city} {format_float_number(city_totals[cheapest_city])}"]
    output_lines.extend([f"{product} {format_float_number(price)}" for product, price in cheapest_products])

    with open(output_path, "w", encoding="utf-8") as f:
        f.write("\n".join(output_lines)+ "\n")


if __name__ == "__main__":

    INPUT_FILE_PATH = "input.txt"
    OUTPUT_FILE_PATH = "output.txt"
    main(INPUT_FILE_PATH, OUTPUT_FILE_PATH)

The input file is input.txt. The output file is output.txt When running this script with Python 3.9 on a 1 billion row file with a size of approximatively 22 GB, the script could successfully produce the intended output. The time of execution was around 780 seconds ≈ 13.1 min. Maybe on the recent python versions, it will be faster. Check the repo to see other solutions