这是indexloc提供的服务,不要输入任何密码
Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
110 commits
Select commit Hold shift + click to select a range
5c99143
script updates to reflect AWS SDK changes
LeonhardFS Dec 7, 2021
4f8fbb6
fix spark link
Dec 8, 2021
c3de456
fixing python package versions
Dec 8, 2021
ee866fa
fix numpy to 1.15.4
LeonhardFS Dec 8, 2021
e67f849
repro effort
LeonhardFS Dec 8, 2021
5bf274a
readme update
LeonhardFS Dec 8, 2021
5a2b2e7
AWS setup instructions
LeonhardFS Dec 8, 2021
142202e
more description text
LeonhardFS Dec 9, 2021
9ddef92
plotting updated, fig3 done
LeonhardFS Dec 9, 2021
aba9167
fig3, fig7, table3 done
LeonhardFS Dec 9, 2021
c51d19a
figure 4 done
LeonhardFS Dec 9, 2021
a91d4dc
figure 8
LeonhardFS Dec 9, 2021
797befd
added figure9
LeonhardFS Dec 9, 2021
80cfd6b
added figure6
LeonhardFS Dec 9, 2021
34578ad
results
LeonhardFS Dec 9, 2021
67f2348
cleanup, plotting now fully working
LeonhardFS Dec 9, 2021
a617955
build script
LeonhardFS Dec 9, 2021
aede534
using NUM_RUNS env variable if available
LeonhardFS Dec 9, 2021
b5629d1
update boost link
LeonhardFS Dec 9, 2021
e55cf2f
fix
LeonhardFS Dec 9, 2021
92b32b9
new sbt install script
LeonhardFS Dec 9, 2021
b5cb722
fix
LeonhardFS Dec 9, 2021
b0cb954
missing libmagic added
LeonhardFS Dec 9, 2021
e2e156b
updated dependencies
LeonhardFS Dec 9, 2021
5a63493
update build script
LeonhardFS Dec 9, 2021
4a6df1d
compile fix
LeonhardFS Dec 9, 2021
dc2abee
fix
LeonhardFS Dec 9, 2021
16ebc5d
make boto3 optional
LeonhardFS Dec 9, 2021
3960b74
Bug fix for file output
bgivertz Dec 10, 2021
1961371
Bug fix for file output
bgivertz Dec 10, 2021
8f1ba2a
force add files
LeonhardFS Dec 10, 2021
335d025
removed debug print, why is this on master?
LeonhardFS Dec 10, 2021
9bef5a5
trying to normalize pattern
LeonhardFS Dec 10, 2021
52f2f70
removing tuplex output dirs becuse of the output validation -.-
LeonhardFS Dec 10, 2021
a59c4ba
add logs
LeonhardFS Dec 10, 2021
3403b17
more output validation challenges
LeonhardFS Dec 10, 2021
43cb9b7
deactivating validation of output specification because it's buggy
LeonhardFS Dec 10, 2021
9e9c42c
updated help message in script
LeonhardFS Dec 10, 2021
87c35d1
container management
LeonhardFS Dec 10, 2021
11ceb99
changed uniqueFileName func
LeonhardFS Dec 10, 2021
cfe3f03
start/stop commands added
LeonhardFS Dec 10, 2021
8e891f0
added run commands
LeonhardFS Dec 10, 2021
6280a58
run
LeonhardFS Dec 10, 2021
44b0969
refactored
LeonhardFS Dec 10, 2021
2bd820e
flag fix
LeonhardFS Dec 10, 2021
e1a2274
detach container automaticallhy
LeonhardFS Dec 10, 2021
43797e0
typo fix
LeonhardFS Dec 10, 2021
e988ceb
fix
LeonhardFS Dec 10, 2021
0713716
speculative container removal
LeonhardFS Dec 10, 2021
7d0fe78
adding missing quotes
LeonhardFS Dec 10, 2021
50c7648
lowercase
LeonhardFS Dec 10, 2021
9e9f675
more printing
LeonhardFS Dec 10, 2021
0b4f797
fix cmd
LeonhardFS Dec 10, 2021
73c7976
debug print
LeonhardFS Dec 10, 2021
2eebb5a
start container
LeonhardFS Dec 10, 2021
bfc3df4
new data download command
LeonhardFS Dec 10, 2021
6962951
link update
LeonhardFS Dec 10, 2021
d63564e
updated README
LeonhardFS Dec 10, 2021
c091188
update
LeonhardFS Dec 10, 2021
41e7942
hanging 7z
LeonhardFS Dec 10, 2021
e17647b
overwrite
LeonhardFS Dec 10, 2021
b9084dd
fix
LeonhardFS Dec 10, 2021
f314457
updating reqs
LeonhardFS Dec 10, 2021
6125b64
docker exec fix
LeonhardFS Dec 10, 2021
3c7e4fd
decode fix
LeonhardFS Dec 10, 2021
231acf7
Merge branch 'master' into sigmod-repro
LeonhardFS Dec 13, 2021
e60816c
add missing path conversion
LeonhardFS Dec 13, 2021
f297917
deactivating output validation, too buggy
LeonhardFS Dec 13, 2021
9b0b609
Trigger CI
bgivertz Dec 13, 2021
b4732aa
added changes in
LeonhardFS Dec 14, 2021
a83d47b
deactivating tests to validate output file specifiation
LeonhardFS Dec 14, 2021
f22881e
fixing root-level script
LeonhardFS Jan 24, 2022
ffd6ae5
force adding sbt plugin required for scala pipeline
LeonhardFS Jan 24, 2022
aa9ccee
also force added for Z2
LeonhardFS Jan 24, 2022
782c58b
correcting dask for pypy
LeonhardFS Jan 24, 2022
0977141
adjusted logs to avoid disk spill
LeonhardFS Jan 25, 2022
9fb94d5
adding path to env, adding file check
LeonhardFS Jan 25, 2022
bcc1e3b
fix
LeonhardFS Jan 25, 2022
3f13870
tuplex check
LeonhardFS Jan 25, 2022
3843ccc
plot script fixes
LeonhardFS Feb 7, 2022
5fa7cb6
dask update
LeonhardFS Feb 7, 2022
4ce36e4
dask for zillow fix
LeonhardFS Feb 8, 2022
662e123
give dask more memory because it crashes
LeonhardFS Feb 8, 2022
d69b2b9
update preprocess scripts and run script
LeonhardFS Feb 8, 2022
f868c4f
updating docker and patching weld
LeonhardFS Feb 8, 2022
06847d5
pip fix
LeonhardFS Feb 8, 2022
5a1863e
fixing weld
LeonhardFS Feb 8, 2022
64121ef
script fix
LeonhardFS Feb 8, 2022
0f06eee
add weld
LeonhardFS Feb 8, 2022
8ba6a34
hyper fix
LeonhardFS Feb 8, 2022
d71eeef
new template for flights run, avoid overwriting of config from breakdown
LeonhardFS Feb 8, 2022
a5793c8
only build C extension
LeonhardFS Feb 9, 2022
1693ba9
build fix
LeonhardFS Feb 9, 2022
3a88b37
change docker
LeonhardFS Feb 9, 2022
9772e4d
fixed to use correct llvm9
LeonhardFS Feb 9, 2022
25dc9ae
added manual clean of build folder to avoid issues
LeonhardFS Feb 9, 2022
443a1a2
discovered plotting bug
LeonhardFS Feb 18, 2022
7ee2d32
added missing 311 tuplex st benchmarks
LeonhardFS Feb 18, 2022
c10ddee
typo fix
LeonhardFS Feb 18, 2022
36f076e
give dask more mem for zillow queries so pypy succeeds
LeonhardFS Feb 18, 2022
d58c578
another fix for a dask script
LeonhardFS Feb 19, 2022
8661e72
removed timeout for dask zillow queries
LeonhardFS Feb 20, 2022
04b973b
fix
LeonhardFS Feb 20, 2022
d776afd
key error fix
LeonhardFS Feb 22, 2022
d3fc518
docker fix
LeonhardFS Feb 23, 2022
d8ffdfc
update
LeonhardFS Feb 23, 2022
bda4ec8
Update README.md
LeonhardFS Feb 24, 2022
45f3d12
merge with current master
LeonhardFS Apr 11, 2022
347b8a7
compile fix
LeonhardFS Apr 11, 2022
a0da80e
Update FileOutputTest.cc
LeonhardFS Apr 11, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 0 additions & 100 deletions benchmarks/311/benchmark.sh

This file was deleted.

48 changes: 39 additions & 9 deletions benchmarks/311/runbenchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,47 +16,77 @@ python3 create_conf.py --opt-null --opt-pushdown --opt-filter --opt-llvm > tuple
python3 create_conf.py --opt-pushdown --opt-filter --opt-llvm > tuplex_config.json
cp tuplex_config.json ${RESDIR}

echo "running tuplex"
# Weld
echo "running weld"
for ((r = 1; r <= NUM_RUNS; r++)); do
LOG="${RESDIR}/weld-run-$r.txt"
rm -rf "${OUTPUT_DIR}/weld_output"
timeout $TIMEOUT ${HWLOC} python2 rungrizzly.py --path $DATA_PATH --output-path ${OUTPUT_DIR}/weld_output >$LOG 2>$LOG.stderr
done

echo "-- running tuplex (single-threaded & multi-threaded)"

cp tuplex_config_mt.json tuplex_config.json
echo "running mt tuplex e2e"
for ((r = 1; r <= NUM_RUNS; r++)); do
LOG="${RESDIR}/tuplex-run-e2e-$r.txt"
rm -rf "${OUTPUT_DIR}/tuplex_output"
timeout $TIMEOUT ${PYTHON} runtuplex.py --path $DATA_PATH --output-path "${OUTPUT_DIR}/tuplex_output" >$LOG 2>$LOG.stderr
timeout $TIMEOUT ${PYTHON} runtuplex.py --path $DATA_PATH --output-path ${OUTPUT_DIR}/tuplex_e2e >$LOG 2>$LOG.stderr
done

echo "running mt tuplex cached"
for ((r = 1; r <= NUM_RUNS; r++)); do
LOG="${RESDIR}/tuplex-run-weld-$r.txt"
rm -rf "${OUTPUT_DIR}/tuplex_output"
timeout $TIMEOUT ${PYTHON} runtuplex.py --path $DATA_PATH --weld-mode --output-path "${OUTPUT_DIR}/tuplex_output" >$LOG 2>$LOG.stderr
timeout $TIMEOUT ${PYTHON} runtuplex.py --path $DATA_PATH --output-path ${OUTPUT_DIR}/tuplex_cached --weld-mode >$LOG 2>$LOG.stderr
done

cp tuplex_config_st.json tuplex_config.json
echo "running st tuplex e2e"
for ((r = 1; r <= NUM_RUNS; r++)); do
LOG="${RESDIR}/sttuplex-run-e2e-$r.txt"
timeout $TIMEOUT ${PYTHON} runtuplex.py --path $DATA_PATH --output-path ${OUTPUT_DIR}/sttuplex_e2e >$LOG 2>$LOG.stderr
done

echo "running st tuplex cached"
for ((r = 1; r <= NUM_RUNS; r++)); do
LOG="${RESDIR}/sttuplex-run-weld-$r.txt"
timeout $TIMEOUT ${PYTHON} runtuplex.py --path $DATA_PATH --output-path ${OUTPUT_DIR}/sttuplex_cached --weld-mode >$LOG 2>$LOG.stderr
done


# spark
export PYSPARK_PYTHON=${PYTHON}
export PYSPARK_DRIVER_PYTHON=${PYTHON}
echo "benchmarking pyspark"
echo "benchmarking pyspark (4 modes)"
echo "benchmarking pyspark e2e"
for ((r = 1; r <= NUM_RUNS; r++)); do
LOG="${RESDIR}/pyspark-run-e2e-$r.txt"
timeout $TIMEOUT spark-submit --master "local[16]" --driver-memory 100g runpyspark.py --path $DATA_PATH --output-path "${OUTPUT_DIR}/spark_output" >$LOG 2>$LOG.stderr
done
echo "benchmarking pyspark cached"
for ((r = 1; r <= NUM_RUNS; r++)); do
LOG="${RESDIR}/pyspark-run-weld-$r.txt"
timeout $TIMEOUT spark-submit --master "local[16]" --driver-memory 100g runpyspark.py --path $DATA_PATH --weld-mode --output-path "${OUTPUT_DIR}/spark_output" >$LOG 2>$LOG.stderr
done
echo "benchmarking pyspark sql e2e"
for ((r = 1; r <= NUM_RUNS; r++)); do
LOG="${RESDIR}/pysparksql-run-e2e-$r.txt"
timeout $TIMEOUT spark-submit --master "local[16]" --driver-memory 100g runpyspark.py --path $DATA_PATH --sql-mode --output-path "${OUTPUT_DIR}/spark_output" >$LOG 2>$LOG.stderr
done
echo "benchmarking pyspark sql cached"
for ((r = 1; r <= NUM_RUNS; r++)); do
LOG="${RESDIR}/pysparksql-run-weld-$r.txt"
timeout $TIMEOUT spark-submit --master "local[16]" --driver-memory 100g runpyspark.py --path $DATA_PATH --sql-mode --weld-mode --output-path "${OUTPUT_DIR}/spark_output" >$LOG 2>$LOG.stderr
done


# Dask
echo "benchmarking dask"
echo "benchmarking dask e2e"
for ((r = 1; r <= NUM_RUNS; r++)); do
LOG="${RESDIR}/dask-run-e2e-$r.txt"
timeout $TIMEOUT ${PYTHON} rundask.py --path $DATA_PATH --output-path "${OUTPUT_DIR}/dask_output" >$LOG 2>$LOG.stderr
timeout $TIMEOUT ${PYTHON} rundask.py --path $DATA_PATH --output-path "${OUTPUT_DIR}/dask_e2e" >$LOG 2>$LOG.stderr
done
echo "benchmarking dask cached"
for ((r = 1; r <= NUM_RUNS; r++)); do
LOG="${RESDIR}/dask-run-weld-$r.txt"
timeout $TIMEOUT ${PYTHON} rundask.py --path $DATA_PATH --weld-mode --output-path "${OUTPUT_DIR}/dask_output" >$LOG 2>$LOG.stderr
timeout $TIMEOUT ${PYTHON} rundask.py --path $DATA_PATH --weld-mode --output-path "${OUTPUT_DIR}/dask_cached" >$LOG 2>$LOG.stderr
done
11 changes: 11 additions & 0 deletions benchmarks/311/rundask.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import shutil
import time
import argparse
import json
Expand Down Expand Up @@ -43,6 +44,16 @@ def fix_zip_codes(zips):
# save the run configuration
output_path = args.output_path

# if dir exists, remove
if os.path.exists(output_path):
shutil.rmtree(output_path)
os.makedirs(output_path, exist_ok=True)

# remove all files within

# dask will fail if it is a directory else
output_path = os.path.join(output_path, 'export-*.csv')

# get the input files
perf_paths = [args.data_path]
if not os.path.isfile(args.data_path):
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/311/rungrizzly.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,4 +47,4 @@

print("Total end-to-end time, including compilation: %.2f" % query_time)

print('framework,pandas_load,load,query\n{},{},{},{}'.format('weld-grizzly', pandas_load_time, load_time, query_time))
print('framework,pandas_load,load,query\n{},{},{},{}'.format('weld-grizzly', pandas_load_time, load_time, query_time))
9 changes: 9 additions & 0 deletions benchmarks/flights/runbenchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ PYTHON=python3.6
mkdir -p ${LG_RESDIR}
mkdir -p ${SM_RESDIR}

# create original tuplex_config.json (gets overwritten by breakdown...)
cp tuplex_config_template.json tuplex_config.json
cat tuplex_config.json

echo "running on large flight data"
echo "running tuplex"
for ((r = 1; r <= NUM_RUNS; r++)); do
Expand Down Expand Up @@ -65,3 +69,8 @@ for ((r = 1; r <= NUM_RUNS; r++)); do
LOG="${SM_RESDIR}/dask-run-$r.txt"
timeout $TIMEOUT ${PYTHON} rundask.py --path $SM_INPUT_PATH --output-path $OUTPUT_DIR/dask >$LOG 2>$LOG.stderr
done

# copy config files after LG run
cp tuplex_config.json ${LG_RESDIR}/
# copy config files after SM run
cp tuplex_config.json ${SM_RESDIR}/
3 changes: 2 additions & 1 deletion benchmarks/flights/rundask.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,8 @@
#client = Client(n_workers=8, threads_per_worker=1, processes=True) # default init
# client = Client(n_workers=16, threads_per_worker=1, processes=True)

client=Client(n_workers=16, threads_per_worker=1, processes=True, memory_limit='8GB')
# because Dask tends to fail for the large flights dataset, give it more memory than Tuplex/Spark
client=Client(n_workers=16, threads_per_worker=1, processes=True, memory_limit='12GB')
#client=Client()
print(client)

Expand Down
11 changes: 11 additions & 0 deletions benchmarks/flights/tuplex_config_template.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{"webui.enable": false,
"executorCount": 15,
"executorMemory": "6G",
"driverMemory": "10G",
"partitionSize": "32MB",
"runTimeMemory": "8MB",
"useLLVMOptimizer": true,
"optimizer.nullValueOptimization": true,
"csv.selectionPushdown": true,
"resolveWithInterpreterOnly":false,
"mergeRowsInOrder":false}
55 changes: 42 additions & 13 deletions benchmarks/logs/process_data.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,19 +1,48 @@
#!/usr/bin/env python3
import glob
import os
import argparse

input_dir = "/disk/data/weblogs/*.*.*.txt"
output_dir = "/disk/data/weblogs_clean"
if __name__ == '__main__':
# parse the arguments
parser = argparse.ArgumentParser(description="Apache data cleaning + join")
parser.add_argument(
"--input-path",
type=str,
dest="input_path",
default="/data/logs",
help="raw logs path",
)
parser.add_argument(
"--output-path",
type=str,
dest="output_path",
default="/data/logs_clean",
help="raw logs path",
)

if not os.path.exists(output_dir):
os.makedirs(output_dir)
args = parser.parse_args()

for filename in glob.glob(input_dir):
new_filename = f'{output_dir}/{filename[filename.rfind("/")+1:]}'
print(f'{filename} -> {new_filename}')
with open(filename, "r", encoding='latin_1') as f:
with open(new_filename, 'w') as fo:
for l in f:
test = l.encode('ascii', 'replace').decode('ascii') # make it ascii
test = test.replace('\0', '?') # get rid of null characters
fo.write(test)
input_dir = os.path.join(args.input_path, "*.*.*.txt")
output_dir = args.output_path

if not os.path.exists(output_dir):
os.makedirs(output_dir)

num_skipped = 0
for filename in glob.glob(input_dir):
new_filename = f'{output_dir}/{filename[filename.rfind("/")+1:]}'

if os.path.isfile(new_filename):
num_skipped += 1
else:
print(f'{filename} -> {new_filename}')
with open(filename, "r", encoding='latin_1') as f:
with open(new_filename, 'w') as fo:
for l in f:
test = l.encode('ascii', 'replace').decode('ascii') # make it ascii
test = test.replace('\0', '?') # get rid of null characters
fo.write(test)
if num_skipped > 0:
print('skipped {} files'.format(num_skipped))
print('Done.')
10 changes: 7 additions & 3 deletions benchmarks/logs/runbenchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,14 @@ if [ $# -eq 1 ]; then # check if hwloc
HWLOC="hwloc-bind --cpubind node:1 --membind node:1 --cpubind node:2 --membind node:2"
fi

DATA_PATH=/data/logs

# invoke preprocess script
python3 process_data.py --input-path /data/logs --output-path /data/logs_clean

DATA_PATH=/data/logs_clean
IP_PATH=/data/logs/ip_blacklist.csv
RESDIR=/results/weblogs
OUTPUT_DIR=/results/output/weblogs
RESDIR=/results/logs
OUTPUT_DIR=/results/output/logs
NUM_RUNS="${NUM_RUNS:-11}"
REDUCED_NUM_RUNS=4
TIMEOUT=14400
Expand Down
3 changes: 2 additions & 1 deletion benchmarks/logs/rundask.py
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,8 @@ def try_int(x):
import pandas as pd
import numpy as np

client = Client(n_workers=16, threads_per_worker=1, processes=True, memory_limit='8GB')
# because Dask tends to fail for the large flights dataset, give it more memory than Tuplex/Spark
client = Client(n_workers=16, threads_per_worker=1, processes=True, memory_limit='14GB')
print(client)
startup_time = time.time() - tstart
print("Dask startup time: {}".format(startup_time))
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/logs/tuplex_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"executorMemory": "6G",
"executorCount": 15,
"driverMemory": "10G",
"partitionSize": "32MB",
"partitionSize": "16MB",
"runTimeMemory": "64MB",
"inputSplitSize": "64MB",
"useLLVMOptimizer": true,
Expand Down
9 changes: 5 additions & 4 deletions benchmarks/sigmod21-reproducibility/AWS_Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,12 @@ sudo chown $(whoami) /disk
# 2. install docker
sudo apt update
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common p7zip-full
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
apt-cache policy docker-ce # should print out apt details
sudo apt install -y docker-ce
sudo apt-get install -y docker-ce docker-ce-cli containerd.io
sudo systemctl status docker # should print status out
sudo usermod -aG docker ${USER} # allows to run docker commands as non-sudo
#logout, login to make user group changes effective
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/sigmod21-reproducibility/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ In `AWS_Configuration.md` we provide the commands we used to configure the machi

### D) Experimentation Info

For convenience we provide a top-level command-line interface `/tuplex.py` to carry out various tasks in order to reproduce the results. In order to carry out experiments, a couple steps to be performed after setting up a benchmark machine (as described in C) or `AWS_Setup.md` / `AWS_Configuration.md`), for which the CLI may be used:
For convenience, we provide a top-level command-line interface `/tuplex.py` to carry out various tasks in order to reproduce the results. In order to carry out experiments, a couple steps to be performed after setting up a benchmark machine (as described in C) or `AWS_Setup.md` / `AWS_Configuration.md`), for which the CLI may be used:

1. download & extract data
`./tuplex.py download --password <PASSWORD HERE>`
Expand Down
Loading