这是indexloc提供的服务,不要输入任何密码
Skip to content

janitbidhan/Ad-Fraud-Detection-PySpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Team :

  • Janit Bidhan
  • Sreenivasa Rayaprolu

Contents of README.md

  • Folder Structure
  • Instructions to Run the files

Folder Structure

    Final Project/ 
	    code/
	        Jupyter Notebooks/
	            Ad_Fraud_with_CatBoost.ipynb
				Ad_Fraud_with_LightGBM.ipynb
				Ad_Fraud_with_RF-LR-SCV.ipynb
				csv_to_parquet.ipynb
	        Python Code/
	            Ad_Fraud_with_CatBoost.py
				Ad_Fraud_with_LightGBM.py
				Ad_Fraud_with_RF-LR-SCV.py
				csv_to_parquet.py
		ScreenShots/
			1.VPC-Creation.png
			2.EMR-Cluster-creation.png
			3.EMR-Cluster-configuration.png
			4.Persues-Command.png
			5.Persues-cluster-output.png
			6.Databricks-Loading-JarFiles.png
			7.Databricks-Cluster-configuration
			8.Databricks-WorkSpace.png
			9.Databricks-python-notebook.png
		Presentation.pdf
		README.md
		REPORT.pdf
		video_presentation_link.txt

Downloading Dataset:

Code Files Description:

  • csv_to_parquet.ipynb and csv_to_parquet.py: This file converts .csv file to .parquet file and saves it in desired location.

  • Ad_Fraud_with_RF-LR-LSVC.ipynb and Ad_Fraud_with_RF-LR-LSVC.py: This is the code file which implements Logistic Regression, Random Forrest, LinearSVC classification models with different sampling techniques.

  • Ad_Fraud_with_CatBoost.ipynb and Ad_Fraud_with_CatBoost.py : This is the code file which implements CatBoost Classification model with different sampling techniques.

  • Ad_Fraud_with_LightGBM.ipynb and Ad_Fraud_with_LightGBM.py : This is the code file which implements LightGBM Classification model with different sampling techniques.

Order of Running the code:

  • 1st : Run csv_to_parquet.py with spark-submit csv_to_parquet.py command
  • 2nd : Run Ad_Fraud_with_RF-LR-LSVC.py with spark-submit Ad_Fraud_with_RF-LR-LSVC.py command.
  • 3rd : Run Ad_Fraud_with_CatBoost.py with spark-submit Ad_Fraud_with_CatBoost.py command.
  • 4th : Run Ad_Fraud_with_LightGBM.py with spark-submit Ad_Fraud_with_LightGBM.py command.

Cluster creations to Run the code

  • Submitted python files can be run on any cluster.
  • We tested our python files on Persues Cluster, Amazon ElasticMapReduce Cluster, Databricks Cluster.

Instructions to Create Cluster on AWS ElasticMapReduce and run code :

  • Create an AWS account.
  • Create a new AWS VPC for this project.
  • Configure a culster with Spark-3.0.2 installed on Hadoop-2.7
  • Use mx2.large machine to create the cluster. By default it create 1 master node(8GB memory) and 2 worker nodes(8GB memory on each).
  • Refer to Screenshots in for detailed EMR Cluster configuration
  • It is recommmended to have above architecture of cluster to be able to run the python files.
  • Generate SSH Keys for using the cluster and save it in secure folder.
  • Update the security Inbound rules of the cluster to include you IP address.
  • Create a new S3 storage Bucket and upload the dataset.
  • Use SSH keys and Public IP Address of the cluster with puTTY or terminal depending on the operating system to SSH into created cluster.
  • Use any file tranfer application application, to move python files in to EMR Cluster.
  • You can use AWS Cloud9 environment to SSH into cluster and run the code
  • Use spark-submit file_name.py to run pyspark code.

Instructions to Create Cluster on Databricks and run it:

  • Create a Databricks account.
  • In Compute section create a cluster with above mentioned configuration.
  • Refer to Screenshots for detailed cluster configuration.
  • In Workspace section you can can upload the Jupyter Notebooks or python files.
  • Go the required file and attach the created spark cluster to the workspace environment.
  • For running code use spark-submit file_name.py in the terminal.

Results in Jupyter Notebooks

  • Submitted Jupyter notebooks have results saved in them.
  • Jupyter Notebooks can be used to for quick inference of results.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •