Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

April 20, 2022 03:08 pm GMT

Understanding/Exploring dataset

Before getting start with actual coding. Let's setup our environment.

As mentioned in the previous blog, after downloading our dataset. We will extract the downloaded zip file GeneratedLabelledFlows.zip. Once, we get all the files they will be upload to google drive in my case /content/gdrive/My Drive/project/dataset/original. Once, its done we will create a new notebook and connect it with google drive.

from google.colab import drivedrive.mount('/content/gdrive')

This may require permission for your google drive.

Hurray! Now, we are successfully connected with google drive which means we can easily create, edit or delete files using Google Colab as we would do in our PC using Jupyter Notebook.

Getting an idea of data

We will use pandas to create Data Frame in order to get an idea of how our data looks like. You can choose any file. I'm going with Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv.

dataset_path = '/content/gdrive/My Drive/project/dataset/'import pandas as pddf = pd.read_csv(dataset_path+'original/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv')df.head()

Head return first five rows of the data frame by default.

We can see that we have a total of 85 fields/columns in our dataset.

Combining all dataset files into one Pandas Data Frame

In order to merge all files into one data frame. We need to make sure all the files have same columns. Remember! All columns not number of columns.

We will create a list of dataframe. Each entry in the list correspond to the dataframe for the respective CSV file in the dataset.

all_files = [dataset_path+'original/'+each_file for each_file in os.listdir(dataset_path+'original/')]all_dfs = [pd.read_csv(each_file,encoding='cp1252') for each_file in all_files]

Now, that we have a list of dataframes. We will check either all the dataframes have same columns or not.

total_columns = all_dfs[0].columnsall_same_column = np.array([])for (index,df) in enumerate(all_dfs):  all_same_column = df.columns == total_columns  if False in all_same_column:    print(f"This {all_files[index]} doesn't have the same columns")

If we got that all the files have same columns then we will proceed which will be the case here. Otherwise, we will be needing to perform furthure processing. Finally, we will merge all the dfs in one single Data Frame.

if np.all(all_same_column):  print("All files have same columns")  merge_df  = pd.concat([each for each in all_dfs]).drop_duplicates(keep=False)  merge_df.reset_index(drop=True, inplace = True)  print("Total Data Shape :" +str(merge_df.shape))else:  print("All files have not same columns")

Original Link: https://dev.to/daud99/understandingexploring-dataset-4fnm

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To