Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
March 15, 2021 01:01 pm GMT

Work with Production Data Deletion

Introduction

The working process of production data deletion is completely different from development tasks.
We should take it as another kind of task, and think with another way. The coding examples are based on RoR in this article.

Estimate the workload

It is difficult to estimate without investigation to work through data deletion.
I can say, do not estimate via the usual development style estimation process.
As a rough estimation, it could take 2 weeks at least to complete the whole process.
Because we should answer the questions listed below in advance.
Furthermore, performance could be a problem after you finish up writing code.
I do recommend doing a sensitive investigation, discuss it with all involved people including QA developers before start working on it.
It would be the fastest way.

Questions we should answer before start coding

  • How much data do we have to delete?
  • How can we test the deletion?
  • What is the data to delete?
  • How many related resources belong to target data?
  • How can we estimate the deletion time?
  • Can we have an alternative approach if it's difficult to delete the whole data?

General coding tips to handle huge data to delete

I show the example via Ruby on Rails(RoR) code but it would be able to translate to other languages easily.
Let's say, you are working with an E-commerce app and we have orders, and order_items belong to each order.

Avoid N+1

Do not iterate all associated relations

If the target data amount is huge, using iteration could be a cause of a terrible performance problem.

Order.where(user_id: 10).all.each do |order|  order.order_items.each do |order_item|     ...  endend

In this example, if the order amount is millions, this leads significant N+1 problem.
We should consider the way not to use ORM magics.

Use delete_all instead of destroy_all

In RoR, there are basically 2 methods to delete records, delete and destroy. destroy methods calls callbacks defined on a model before deleting the object. On the other hand, delete doesn't see callbacks set up on the model, simply calling a delete SQL.
Given that, delete_all is way faster than destroy_all, and it makes a significant performance difference if the data is big.
And, it means we should take care of the operation we defined in the before_destroy or after_destroy callbacks manually.

Use batches

If it comes to delete even thousands of records, we should handle it carefully to use database resources sufficiently. Probably simply calling delete_all is not enough.
In that case, use something like built-in method find_in_batches or delete_in_batches gem.
The batch size should be around some thousands to prevent DB stuck. In find_in_batches, default is 1,000. In delete_in_batches, default is 10,000.

Order.find_in_batches(batch_size: 5000).each do |orders|  orders.delete_all  sleep(1) # throttle down DBend

Set asleep to throttle down the DB

Like the above example, looks good to throttle down by using sleep in batches when we delete in batches. The RoR official website recommends it.

When the deletion is quite slow

Possibly you couldn't find this problem until using the full prod pruned data. That is why using prod pruned data is important.
Anyhow, in many cases, the cause of the slow query is full table scan due to no index as you know.

Case 1: No index on the column used in where SQL

DELETE users where email = 'something@localhost';

SQL tries to look for target records by email but there is no index on email so that it does a full table scan. It is simple and you are probably familiar with it.

Case 2: No index on the column to use constraint check

DELETE addresses;
                  Table "addresses"   Column  | Type |     Modifiers-----------+------+------------------------------ id        | int  | not null default nextval(... street_no | int  | not null...Indexes:    "idx_primary" PRIMARY KEY, btree (id)Referenced by:    TABLE "users" CONSTRAINT "fk_users_addresses" FOREIGN KEY (address_id) REFERENCES addresses(id) NOT VALID                  Table "users"    Column   |  Type  |         Modifiers-------------+--------+-------------------------------- id          | bigint | not null default nextval(... address_id  | int    |...Indexes:    "idx_primary" PRIMARY KEY, btree (id)Foreign-key constraints:    "fk_users_addresses" FOREIGN KEY (address_id) REFERENCES addresses(id) NOT VALID

SQL tries to delete addresses but addresses referenced by users table, and it has a foreign key constraint on address_id. But because address_id does not have index, SQL does a full table scan for users when deleting addresses

When we need further consideration

If the target data has a lot of associations, we should consider deleting some of them tentatively.
For example, an order has some order_items, purchaced_users, addresses, items, etc...
But in this case, we will create orphan data which doesn't have the parent data(i, e. relations of orders and order_items, deleting only orders)
This would be tech debt, so when we take this solution, we should have common sense with the people involved.

Create to_be_deleted tables

To delete only required data and delete orphan data in the future, consider creating "to_be_deleted" tables which have ids of target data to be deleted. Those tables look like below.

-- orders_to_be_deletedTable "orders_to_be_deleted"  Column  |  Type   | Modifiers----------+---------+----------- order_id | integer | not nullIndexes:    "index_orders_to_be_deleted_on_order_id" btree (order_id)

By using these id tables, we can refer to deleted ids after deleting orders themselves for example.

Conclusion

Those are what we should consider when deleting data.
I'm glad if someone won't do the same mistakes as mine when working on data deletion projects.


Original Link: https://dev.to/kazu9su/work-with-production-data-deletion-3p6j

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To