Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
November 22, 2022 08:07 pm GMT

Secure Breast Cancer Identification with Enclaves

Introduction

In this blog we develop a logistic regression model for breast cancer identification while ensuring that the sensitive medical data used for training the model remains private using Cape Privacys confidential computing platform.

The Issue of Privacy of Medical Data

Performing data analysis and modeling on medical data can provide extremely useful insights into both public and individual health. However, there are two primary challenges when it comes to running statistical analyses or developing predictive models with medical data. The first challenge is the size of medical data sets. Medical trials often include a number of participants that may be too small for creating complex machine learning models. The second challenge is the fact that medical data are governed by various privacy rules and laws such as HIPAA.

One approach to solving this issue is to use differential privacy techniques that obscure data points related to specific individuals to preserve privacy of their information. However, the downside is that this allows for studying only aggregated data at a high level. Moreover, the noise added to individual data points to ensure privacy may result in data that is manipulated too far from the original. Therefore, there is a trade-off between the noise added (also called the privacy budget ) that provides stronger privacy protection and the utility of the data. The figure below demonstrates this trade-off between data privacy and utility with an example of an employee database query that returns the total number of employees in two different months. We can see that as the privacy budget increases, the total count of employees becomes inaccurate, which may help to hide some private information such as a termination of a specific individual at the expense of accurately reporting the total headcount.
Figure 1: Differential privacy techniques have a trade-off between privacy protection and data utility (Chang et al., 2021) Figure 1: Differential privacy techniques have a trade-off between privacy protection and data utility (Chang et al., 2021)

This is where Capes confidential computing platform based on AWS Nitro enclaves comes in.

How Does Cape Ensure that Confidential Data Remains Private?

Capes confidential computing platform allows its users to process data in a privacy preserving manner without needing to make a compromise between data privacy and utility. With Cape, you dont have to use differential privacy methods, instead you can process your original data as is, because your data will be encrypted and processed in a secure enclave in the cloud.

Cape provides a CLI that enables its users to encrypt their input data, and deploy and run serverless functions with easy commands: cape encrypt, cape deploy, and cape run. Additionally, Cape also provides two SDKs: pycape and cape-js, which allow for using cape within Python and JavaScript programs respectively.

Training a Breast Cancer Identification Model with Cape?

In this blog we will use a publicly available breast cancer dataset, which contains tabular data describing several attributes that describe the breast tumor (e.g.: the size and shape of the tumor) along with a classification of the tumor as malignant or benign. For example, a tumor that is uniform and has a round shape typically indicates that it is noncancerous.

While this dataset is publicly available, most medical data is not, and we will use it as an example to demonstrate how Cape can be leveraged for private medical data processing.

Logistic Regression

Since the model that we wish to develop is a binary classification model that identifies breast tumors as malignant or benign and the number of data points is not very large, a logistic regression model is suitable.

Logistic regression is a classification model that uses input attributes to predict a categorical variable, eg. yes or no. In this demonstration we focus on a binary classification since there are only two possible outcomes.

Create a Function that Trains a Logistic Regression Model

Any function that is deployed with Cape needs to be named app.py, where app.py needs to contain a function called cape_handler() that takes the input that the function processes and returns the results. In this case the input is the breast cancer dataset that serves as training data and the output is the trained logistic regression model.

The code snippet below shows our app.py. First, we import some libraries as follows:

import pandas as pdimport numpy as npimport copy

Then we define a logistic regression class with methods that can perform training or compute model accuracy and loss:

class LogisticRegression():    def __init__(self):        self.losses = []        self.train_accuracies = []    def accuracy_score(self, y_true, y_pred):        correct = np.sum(y_true == y_pred)        accuracy = correct/y_true.shape[0]        return accuracy    def fit(self, x, y, epochs):        x = self._transform_x(x)        y = self._transform_y(y)        self.weights = np.zeros(x.shape[1])        self.bias = 0        for i in range(epochs):            x_dot_weights = np.matmul(self.weights, x.transpose()) + self.bias            pred = self._sigmoid(x_dot_weights)            loss = self.compute_loss(y, pred)            error_w, error_b = self.compute_gradients(x, y, pred)            self.update_model_parameters(error_w, error_b)            pred_to_class = [1 if p > 0.5 else 0 for p in pred]            self.train_accuracies.append(self.accuracy_score(y, pred_to_class))            self.losses.append(loss)    def compute_loss(self, y_true, y_pred):        # binary cross entropy        y_zero_loss = y_true * np.log(y_pred + 1e-9)        y_one_loss = (1-y_true) * np.log(1 - y_pred + 1e-9)        return -np.mean(y_zero_loss + y_one_loss)    def compute_gradients(self, x, y_true, y_pred):        # derivative of binary cross entropy        difference =  y_pred - y_true        gradient_b = np.mean(difference)        gradients_w = np.matmul(x.transpose(), difference)        gradients_w = np.array([np.mean(grad) for grad in gradients_w])        return gradients_w, gradient_b    def update_model_parameters(self, error_w, error_b):        self.weights = self.weights - 0.1 * error_w        self.bias = self.bias - 0.1 * error_b    def predict(self, x):        x_dot_weights = np.matmul(x, self.weights.transpose()) + self.bias        probabilities = self._sigmoid(x_dot_weights)        return [1 if p > 0.5 else 0 for p in probabilities]    def _sigmoid(self, x):        return np.array([self._sigmoid_function(value) for value in x])    def _sigmoid_function(self, x):        if x >= 0:            z = np.exp(-x)            return 1 / (1 + z)        else:            z = np.exp(x)            return z / (1 + z)    def _transform_x(self, x):        x = copy.deepcopy(x)        return x.values    def _transform_y(self, y):        y = copy.deepcopy(y)        return y.values.reshape(y.shape[0], 1)

In addition to the logistic regression class, our app.py also contains the required cape_handler function, which takes the training data as input, splits it into a train and test set, instantiates the above defined logistic regression class, performs training, and outputs the trained model along with its accuracy.

def cape_handler(input_data):    csv = input_data.decode("utf-8")    csv = csv.replace('\', ',').replace('\
', '
') f = open('data.csv', 'w') f.write(csv) f.close() data = pd.read_csv('data.csv') data_size = data.shape[0] test_split = 0.33 test_size = int(data_size * test_split) choices = np.arange(0, data_size) test = np.random.choice(choices, test_size, replace=False) train = np.delete(choices, test) test_set = data.iloc[test] train_set = data.iloc[train] column_names = list(data.columns.values) features = column_names[1:len(column_names)-1] y_train = train_set["target"] y_test = test_set["target"] X_train = train_set[features] X_test = test_set[features] lr = LogisticRegression() lr.fit(X_train, y_train, epochs=150) pred = lr.predict(X_test) accuracy = lr.accuracy_score(y_test, pred) # trained model model = {"accuracy": accuracy, "weights": lr.weights.tolist(), "bias": lr.bias.tolist()} return model

Deploy with Cape

To deploy our function with Cape, we first need to create a folder that contains all needed dependencies. For this logistic regression training app, that deployment folder needs to contain the app.py above. Additionally, because the app.py program imports some external libraries (in this case: numpy and pandas), the deployment folder needs to have those as well. We can save a list of those dependencies into a requirements.txt file and run docker to install those dependencies into our deployment folder called app as follows:

sudo docker run -v `pwd`:/build -w /build --rm -it python:3.9-slim-bullseye pip install -r requirements.txt --target ./app/

Now that we have everything ready, we can log into Cape:

cape loginYour CLI confirmation code is: GZPN-KHMTVisit this URL to complete the login process: https://login.capeprivacy.com/activate?user_code=GZPN-KHMTCongratulations, you're all set!

And after that we can deploy the app:

cape deploy ./appDeploying function to Cape ...Success! Deployed function to CapeFunction Checksum  348ea2008f014b4d62562b4256bf2ddbbebcbd8b958981de5c2e01a973f690f8Function Id  5wggR4ZaEBdfHQSbV2RcN5

Invoke with Cape

Now that the app is deployed, we can pass it an input and invoke it with cape run:

cape run 5wggR4ZaEBdfHQSbV2RcN5 -f breast_cancer_data.csv{'accuracy': 0.9197860962566845, 'weights': [10256.691270418847, 19071.613672774896, 63157.95554188486, 97842.31573298419, 106.154850842932, 43.29810217015701, -44.1862890971466, -22.519840356544492, 198.12010662303672, 78.6238754895288, 48.39822623036688, 1508.6634081937177, 342.695612801048, -22814.6600120419, 8.905474463874354, 16.958969184554977, 18.625567417774857, 7.857666827748692, 25.00139435235602, 4.305377619109947, 9667.094831413606, 24077.953801047104, 59698.82218324606, -91019.69570680606, 137.85512994764406, 64.23315269371734, -35.801829085602265, 1.0606119748691598, 287.2889897905756, 89.52499975392664], 'bias': 3.247905759162303}

The output above lists the parameters of the trained model, i.e.: its weights and bias, which define the model and can be used to perform inference. Additionally, we can also see that the trained model accuracy on testing data is 92%.

Conclusion

In this blog we discussed the challenges of developing predictive models on medical data and how Capes confidential computing platform can alleviate privacy issues associated with medical data processing. We defined a logistic regression model and trained to identify breast tumors as malignant or benign while keeping the medical data that was used for training confidential.


Original Link: https://dev.to/ekloberdanz/secure-breast-cancer-identification-with-enclaves-4ic1

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To