PySpark startup time: 1.8000900745391846
Pyspark job time: 4817.788696050644 s
== Parsed Logical Plan ==
Join Inner, (ip#6 = BadIPs#35)
:- Project [ip#6, client_id#7, user_id#8, date#9, method#10, RandomizeEndpointUDF(endpoint#11) AS endpoint#25, protocol#12, response_code#13, content_size#14]
:  +- Project [col1#3.ip AS ip#6, col1#3.client_id AS client_id#7, col1#3.user_id AS user_id#8, col1#3.date AS date#9, col1#3.method AS method#10, col1#3.endpoint AS endpoint#11, col1#3.protocol AS protocol#12, col1#3.response_code AS response_code#13, col1#3.content_size AS content_size#14]
:     +- Project [col0#0, ParseWithRegex(col0#0) AS col1#3]
:        +- Relation[col0#0] csv
+- Relation[BadIPs#35] csv

== Analyzed Logical Plan ==
ip: string, client_id: string, user_id: string, date: string, method: string, endpoint: string, protocol: string, response_code: int, content_size: int, BadIPs: string
Join Inner, (ip#6 = BadIPs#35)
:- Project [ip#6, client_id#7, user_id#8, date#9, method#10, RandomizeEndpointUDF(endpoint#11) AS endpoint#25, protocol#12, response_code#13, content_size#14]
:  +- Project [col1#3.ip AS ip#6, col1#3.client_id AS client_id#7, col1#3.user_id AS user_id#8, col1#3.date AS date#9, col1#3.method AS method#10, col1#3.endpoint AS endpoint#11, col1#3.protocol AS protocol#12, col1#3.response_code AS response_code#13, col1#3.content_size AS content_size#14]
:     +- Project [col0#0, ParseWithRegex(col0#0) AS col1#3]
:        +- Relation[col0#0] csv
+- Relation[BadIPs#35] csv

== Optimized Logical Plan ==
Join Inner, (ip#6 = BadIPs#35)
:- Project [pythonUDF8#101.ip AS ip#6, pythonUDF8#101.client_id AS client_id#7, pythonUDF8#101.user_id AS user_id#8, pythonUDF8#101.date AS date#9, pythonUDF8#101.method AS method#10, pythonUDF0#102 AS endpoint#25, pythonUDF8#101.protocol AS protocol#12, pythonUDF8#101.response_code AS response_code#13, pythonUDF8#101.content_size AS content_size#14]
:  +- BatchEvalPython [RandomizeEndpointUDF(pythonUDF8#101.endpoint)], [pythonUDF8#101, pythonUDF0#102]
:     +- Project [pythonUDF8#101]
:        +- BatchEvalPython [ParseWithRegex(col0#0), ParseWithRegex(col0#0), ParseWithRegex(col0#0), ParseWithRegex(col0#0), ParseWithRegex(col0#0), ParseWithRegex(col0#0), ParseWithRegex(col0#0), ParseWithRegex(col0#0), ParseWithRegex(col0#0)], [col0#0, pythonUDF0#93, pythonUDF1#94, pythonUDF2#95, pythonUDF3#96, pythonUDF4#97, pythonUDF5#98, pythonUDF6#99, pythonUDF7#100, pythonUDF8#101]
:           +- Project [col0#0]
:              +- Filter isnotnull(pythonUDF0#92.ip)
:                 +- BatchEvalPython [ParseWithRegex(col0#0)], [col0#0, pythonUDF0#92]
:                    +- Relation[col0#0] csv
+- Filter isnotnull(BadIPs#35)
   +- Relation[BadIPs#35] csv

== Physical Plan ==
*(5) BroadcastHashJoin [ip#6], [BadIPs#35], Inner, BuildRight
:- *(5) Project [pythonUDF8#101.ip AS ip#6, pythonUDF8#101.client_id AS client_id#7, pythonUDF8#101.user_id AS user_id#8, pythonUDF8#101.date AS date#9, pythonUDF8#101.method AS method#10, pythonUDF0#102 AS endpoint#25, pythonUDF8#101.protocol AS protocol#12, pythonUDF8#101.response_code AS response_code#13, pythonUDF8#101.content_size AS content_size#14]
:  +- BatchEvalPython [RandomizeEndpointUDF(pythonUDF8#101.endpoint)], [pythonUDF8#101, pythonUDF0#102]
:     +- *(3) Project [pythonUDF8#101]
:        +- BatchEvalPython [ParseWithRegex(col0#0), ParseWithRegex(col0#0), ParseWithRegex(col0#0), ParseWithRegex(col0#0), ParseWithRegex(col0#0), ParseWithRegex(col0#0), ParseWithRegex(col0#0), ParseWithRegex(col0#0), ParseWithRegex(col0#0)], [col0#0, pythonUDF0#93, pythonUDF1#94, pythonUDF2#95, pythonUDF3#96, pythonUDF4#97, pythonUDF5#98, pythonUDF6#99, pythonUDF7#100, pythonUDF8#101]
:           +- *(2) Project [col0#0]
:              +- *(2) Filter isnotnull(pythonUDF0#92.ip)
:                 +- BatchEvalPython [ParseWithRegex(col0#0)], [col0#0, pythonUDF0#92]
:                    +- *(1) FileScan csv [col0#0] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/data/logs_clean/2000.01.01.txt, file:/data/logs_clean/2000.01.02.txt, fil..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col0:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
   +- *(4) Project [BadIPs#35]
      +- *(4) Filter isnotnull(BadIPs#35)
         +- *(4) FileScan csv [BadIPs#35] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/data/logs/ip_blacklist.csv], PartitionFilters: [], PushedFilters: [IsNotNull(BadIPs)], ReadSchema: struct<BadIPs:string>
{"startupTime": 1.8000900745391846, "jobTime": 4817.788696050644}
