Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
October 17, 2022 04:51 am GMT

Data Science: Project Structure

Why should you use project structure?

structure is of once preference but at the end of day all that matter is you should be comfortable moving around and writing code.

When we think about data science/analysis, we often think it is just about result, charts, figures, insights or visualization. While these end products are generally the main event, it's easy to focus on making the product look nice and forget about the quality of code and reusability and ability to collaborate with other and for that project structure matters.

That being said there is no correct method of structuring your project but you should at least have some structure that we can follow for all our project for standardization. You might be thinking why one should use a project structure!! Here are some reasons behind that:

Happy team member

Nobody sits around before creating a new Rails project to figure out where they want to put their views; they just runrails new to get a standard project skeleton like everybody else.

A well-defined, standard project structure means that a newcomer can begin to understand an analysis without digging in to extensive documentation. It also means that they don't necessarily have to read 100% of the code before knowing where to look for very specific things.
Well organized code tends to be self-documenting in that the organization itself provides context for your code without much overhead. People will thank you for this because they can:

  • Collaborate more easily with you on this analysis
  • Learn from your analysis about the process and the domain
  • Feel confident in the conclusions at which the analysis arrives

Happy future you

Ever tried to reproduce an analysis that you did a few months ago or even a few years ago? You may have written the code, but it's now impossible to decipher whether you should usemake_figures.py.old, make_figures_working.pyornew_make_figures01.py, try_figre.py to get things done. And you start questioning your existence, that happened to all of us at some point of time. Here are some questions we ask when we open an old data science project.

  • Where did I stored the data files?
  • Where did the plots get saved?
  • Where did I saved the ML model?

These types of questions are painful and are symptoms of a disorganized project. A good project structure encourages practices that make it easier to come back to old work.

Getting Started

Consistency within a project is more important. Consistency within one module or function is the most important... However, know when to be inconsistent -- sometimes style guide recommendations just aren't applicable. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don't hesitate to ask!

Requirements

  • Python 3.5+

Starting a new project

Starting a new project is as easy as running this command at the command line.

  • Create a folder.
  • Copy paste setup.py.
  • Run the python file with python setup.py
# setup.pyimportosfolders=["data","data/processed","data/raw","data/external","data/transformed","docs","models","notebooks","references","reports","reports/figures","src","src/data","src/features","src/models","src/visualization",]forfolderinfolders:os.makedirs(folder,exist_ok=True)withopen(os.path.join(folder,".gitkeep"),"w")asf:passfiles=["src/__init__.py","src/data/make_dataset.py","src/features/build_features.py","src/models/train_model.py","src/models/predict_model.py","src/visualization/visualize.py","requirements.txt"]forfileinfiles:withopen(file,"w")asf:pass

Directory structure

 data    external       <- Data from third party sources.    transformed    <- Intermediate data that has been transformed.    processed      <- The final, canonical data sets for modeling.    raw            <- The original, immutable data dump. docs               <- A default Sphinx project; see sphinx-doc.org for details models             <- Trained and serialized models, model predictions, or model summaries notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),                         the creator's initials, and a short `-` delimited description, e.g.                         `1.0-jqp-initial-data-exploration`. references         <- Data dictionaries, manuals, and all other explanatory materials. reports            <- Generated analysis as HTML, PDF, LaTeX, etc.    figures        <- Generated graphics and figures to be used in reporting requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.                         generated with `pip freeze > requirements.txt` setup.py           <- Make this project pip installable with `pip install -e` src                <- Source code for use in this project.    __init__.py    <- Makes src a Python module       data           <- Scripts to download or generate data       make_dataset.py       features       <- Scripts to turn raw data into features for modeling       build_features.py       models         <- Scripts to train models and then use trained models to make                       predictions       predict_model.py       train_model.py       visualization  <- Scripts to create exploratory and results oriented visualizations        visualize.py 

Opinions

This is by no means the best project structure but it a starting point. This structure provide a starting point, you can iterate over it to make it you own. Some of the opinions are about workflows, and some of the opinions are about tools that make life easier.

Data is immutable

Don't ever edit your raw data, especially not manually, and especially not in Excel. Don't overwrite your raw data. Don't save multiple versions of the raw data. Treat the data (and its format) as immutable. The code you write should move the raw data through a pipeline to your final analysis. You shouldn't have to run all of the steps every time you want to make a new figures but anyone should be able to reproduce the final products with only the code insrcand the data indata/raw. Also, if data is immutable, it doesn't need source control in the same way that code does. Therefore,by default, the data folder is included in the.gitignorefile. If you have a small amount of data that rarely changes, you may want to include the data in the repository.

Notebooks are for exploration and trying

Notebooks are very effective for exploratory data analysis. However, these tools can be less effective for reproducing an analysis. We should subdivide thenotebooksfolder. For example, notebooks/exploratorycontains initial explorations, whereasnotebooks/reportsis more polished work that can be exported as html to thereportsdirectory.

References

  1. https://www.youtube.com/watch?v=MaIfDPuSlw8&t=264s
  2. https://drivendata.github.io/cookiecutter-data-science/
  3. https://learn.microsoft.com/en-us/azure/architecture/data-science-process/overview
  4. https://github.com/Azure/Azure-TDSP-ProjectTemplate

Original Link: https://dev.to/azadkshitij/data-science-project-structure-5dh

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To