- Published on
Coding challenge: Get useful data from a 1b rows text file with Python
- Authors
- Name
- Ismail Tlemcani
- @Ismailtlem
Lately, I participated in a coding challenge by the very awesome moroccan tech community (geeksblabla) in their annual tech online conference. Here is the link of the coding challenge
What is the problem in a nutshell ?
You have a 1 billion row text file that looks like this
casa,tomato,6.23
casa,tomato,7.23
casa,tomato,8.23
casa,tomato,9.23
casa,potato,4.21
casa,flour,6.24
casa,oil,9.24
casa,oil,9.94
casa,oil,8.24
The first value is the name of the city, the second value is the name of the product, the third value is the price of the product. You have to get the name of the cheapest city and the total price of products in that city, separated by one space. On the following lines, you have to get the cheapest 5 products in that city. You can see the README for examples.
Python solution
from collections import defaultdict
def read_large_file(file_path, chunk_size=1024 * 1024):
"""Read input file by chunks"""
with open(file_path, "r", encoding="utf-8") as file:
while True:
chunk = file.readline(chunk_size)
if not chunk:
break
line = chunk.strip()
if line:
yield line
def format_float_number(number):
"""format float number to 2 decimals"""
return "{:.2f}".format(number)
def main(input_path, output_path):
"""Main function"""
city_totals = defaultdict(float)
city_products = defaultdict(lambda: defaultdict(float))
for line in read_large_file(input_path):
city, product, price = line.split(",")
price = float(price)
city_totals[city] += price
if not city_products[city][product] or price < city_products[city][product]:
city_products[city][product] = price
cheapest_city = min(city_totals, key=city_totals.get)
cheapest_products = sorted(city_products[cheapest_city].items(), key=lambda x: (x[1], x[0]))[:5]
output_lines = [f"{cheapest_city} {format_float_number(city_totals[cheapest_city])}"]
output_lines.extend([f"{product} {format_float_number(price)}" for product, price in cheapest_products])
with open(output_path, "w", encoding="utf-8") as f:
f.write("\n".join(output_lines)+ "\n")
if __name__ == "__main__":
INPUT_FILE_PATH = "input.txt"
OUTPUT_FILE_PATH = "output.txt"
main(INPUT_FILE_PATH, OUTPUT_FILE_PATH)
The input file is input.txt. The output file is output.txt When running this script with Python 3.9 on a 1 billion row file with a size of approximatively 22 GB, the script could successfully produce the intended output. The time of execution was around 780 seconds ≈ 13.1 min. Maybe on the recent python versions, it will be faster. Check the repo to see other solutions