Language FeaturesPythonParsing Large Log Files

Parsing Logs

The following used a small 150 kilobytes Traefik access.log.

BAD

import tracemalloc
 
tracemalloc.start()
 
with open("access.log", "r") as f:
    for line in f.readlines():
        print(line.split()[0])
 
print(tracemalloc.get_traced_memory())
tracemalloc.stop()

The peak memory consumption was 238.501 kilobytes.

It is very important to not load everything into memory with f.readlines(), instead you should read from the buffer to reduce memory consumption.

Good

import tracemalloc
 
tracemalloc.start()
 
with open("access.log", "r") as f:
    for line in f:
        print(line.split()[0])
 
print(tracemalloc.get_traced_memory())
tracemalloc.stop()

The peak memory consumption was 22.762 kilobytes. 91% smaller than f.readlines().

Reading from the buffer will keep memory consumption low. However, the performance of the 2 programs in this case were very similar (~0.025 seconds). In larger programs or larger log files, this may cause a performance penalty because if fast RAM runs out, the slower swap memory will be used.