上一篇文章讲了怎么用机器学习的方法进行阿里云安全恶意程序检测比赛,本章主要看深度学习如何实现阿里云安全恶意程序检测 TextCNN建模数据读取模块引入importpandasaspdimportnumpyasnpimportseabornassnsimportmatplotlib。pyplotaspltimportlightgbmaslgbfromsklearn。modelselectionimporttraintestsplitfromsklearn。preprocessingimportOneHotEncoderfromtqdmimporttqdmnotebookfromsklearn。preprocessingimportLabelBinarizer,LabelEncoderimportwarningswarnings。filterwarnings(ignore)matplotlibinline读取本地文档数据path。。securitydatatrainpd。readcsv(pathsecuritytrain。csv)testpd。readcsv(pathsecuritytest。csv)模块引入importnumpyasnpimportpandasaspdfromtqdmimporttqdm存储占用大小统计classDataPreprocess:definit(self):self。int8maxnp。iinfo(np。int8)。maxself。int8minnp。iinfo(np。int8)。minself。int16maxnp。iinfo(np。int16)。maxself。int16minnp。iinfo(np。int16)。minself。int32maxnp。iinfo(np。int32)。maxself。int32minnp。iinfo(np。int32)。minself。int64maxnp。iinfo(np。int64)。maxself。int64minnp。iinfo(np。int64)。minself。float16maxnp。finfo(np。float16)。maxself。float16minnp。finfo(np。float16)。minself。float32maxnp。finfo(np。float32)。maxself。float32minnp。finfo(np。float32)。minself。float64maxnp。finfo(np。float64)。maxself。float64minnp。finfo(np。float64)。mindefgettype(self,minval,maxval,types):iftypesint:ifmaxvalself。int8maxandminvalself。int8min:returnnp。int8elifmaxvalself。int16maxmaxvalandminvalself。int16min:returnnp。int16elifmaxvalself。int32maxandminvalself。int32min:returnnp。int32returnNoneeliftypesfloat:ifmaxvalself。float16maxandminvalself。float16min:returnnp。float16ifmaxvalself。float32maxandminvalself。float32min:returnnp。float32ifmaxvalself。float64maxandminvalself。float64min:returnnp。float64returnNonedefmemoryprocess(self,df):initmemorydf。memoryusage()。sum()102421024print(Originaldataoccupies{}GBmemory。。format(initmemory))dfcolsdf。columnsforcolintqdmnotebook(dfcols):try:iffloatinstr(df〔col〕。dtypes):maxvaldf〔col〕。max()minvaldf〔col〕。min()transtypesself。gettype(minval,maxval,float)iftranstypesisnotNone:df〔col〕df〔col〕。astype(transtypes)elifintinstr(df〔col〕。dtypes):maxvaldf〔col〕。max()minvaldf〔col〕。min()transtypesself。gettype(minval,maxval,int)iftranstypesisnotNone:df〔col〕df〔col〕。astype(transtypes)except:print(Cannotdoanyprocessforcolumn,{}。。format(col))afterprocessmemorydf。memoryusage()。sum()102421024print(Afterprocessing,thedataoccupies{}GBmemory。。format(afterprocessmemory))returndfmemoryprocessDataPreprocess()train。head() 数据预处理(字符串转化为数字)uniqueapitrain〔api〕。unique()将api以字典的形式存储api2index{item:(i1)fori,iteminenumerate(uniqueapi)}index2api{(i1):itemfori,iteminenumerate(uniqueapi)}训练集、测试集新增apiidx字段,将api的取值映射到api2index,展示的值是api2index的values值train〔apiidx〕train〔api〕。map(api2index)test〔apiidx〕test〔api〕。map(api2index)获取每个文件对应的字符串序列defgetsequence(df,periodidx):seqlist〔〕结合上下文代码可知:参数periodidx指的是trainperiodidxtestperiodidxtrainperiodidxtestperiodidx:获得fileid,去除重复值,仅保留第一次出现的值periodidx〔:1〕:获得数组的所有值forid,begininenumerate(periodidx〔:1〕):结合上下文代码可知,参数df是traintestdf。iloc〔begin:periodidx〔id1〕〕是指获取某一fileid的所有数据df。iloc〔begin:periodidx〔id1〕〕〔apiidx〕是指获取某一fileid的所有数据对应的apiidx值seqlist。append(df。iloc〔begin:periodidx〔id1〕〕〔apiidx〕。values)弥补df。iloc〔begin:periodidx〔id1〕〕〔apiidx〕,将剩余的数据也获取并加入到seqlist列表中seqlist。append(df。iloc〔periodidx〔1〕:〕〔apiidx〕。values)returnseqlisttrainperiodidxtrain。fileid。dropduplicates(keepfirst)。index。valuestestperiodidxtest。fileid。dropduplicates(keepfirst)。index。values获取训练集中fileid、label,去除重复值,仅保留第一次出现的值traindftrain〔〔fileid,label〕〕。dropduplicates(keepfirst)testdftest〔〔fileid〕〕。dropduplicates(keepfirst)将以fileid划分得到的train〔apiidx〕的值获得的值赋值给traindf〔seq〕traindf〔seq〕getsequence(train,trainperiodidx)testdf〔seq〕getsequence(test,testperiodidx) 到这里,traindf包含fileid、label和seq三个字段,其中seq的值是一组列表 traindf的图标输出可以理解为以fileid、label两个字段分组,展示每条数据对应api的值TextCNN网络结构模块引入fromkeras。preprocessing。textimportTokenizerfromkeras。preprocessing。sequenceimportpadsequencesfromkeras。layersimportDense,Input,LSTM,Lambda,Embedding,Dropout,Activation,GRU,Bidirectionalfromkeras。layersimportConv1D,Conv2D,MaxPooling2D,GlobalAveragePooling1D,GlobalMaxPooling1D,MaxPooling1D,Flattenfromkeras。layersimportCuDNNGRU,CuDNNLSTM,SpatialDropout1Dfromkeras。layers。mergeimportconcatenate,Concatenate,Average,Dot,Maximum,Multiply,Subtract,averagefromkeras。modelsimportModelfromkeras。optimizersimportRMSprop,Adamfromkeras。layers。normalizationimportBatchNormalizationfromkeras。callbacksimportEarlyStopping,ModelCheckpointfromkeras。optimizersimportSGDfromkerasimportbackendasKfromsklearn。decompositionimportTruncatedSVD,NMF,LatentDirichletAllocationfromkeras。layersimportSpatialDropout1Dfromkeras。layers。wrappersimportBidirectional TextCNN方法defTextCNN(maxlen,maxcnt,embedsize,numfilters,kernelsize,convaction,maskzero):inputInput(shape(maxlen,),dtypeint32)embedEmbedding(maxcnt,embedsize,inputlengthmaxlen,maskzeromaskzero)(input)embedSpatialDropout1D(0。15)(embed)warppers〔〕forkernelsizeinkernelsize:conv1dConv1D(filtersnumfilters,kernelsizekernelsize,activationconvaction)(embed)warppers。append(GlobalMaxPooling1D()(conv1d))fcconcatenate(warppers)fcDropout(0。5)(fc)fcBatchNormalization()(fc)fcDense(256,activationrelu)(fc)fcDropout(0。25)(fc)fcBatchNormalization()(fc)predsDense(8,activationsoftmax)(fc)modelModel(inputsinput,outputspreds)model。compile(losscategoricalcrossentropy,optimizeradam,metrics〔accuracy〕)returnmodelgetdummies()作用是将一些分类变量,如性别、国家、省份、职业、婚姻状况等变量转换成多个二进制变量,即一个变量有多个可能的值,就可以转化为多个二进制变量,这样可以方便数据分析,更加准确地反应原有数据集合之间的关系。trainlabelspd。getdummies(traindf。label)。valuespadsequences用于确保列表中的所有序列具有相同的长度。默认情况下,这是通过0在每个序列的开头填充直到每个序列与最长序列具有相同的长度来完成的trainseqpadsequences(traindf。seq。values,maxlen6000)testseqpadsequences(testdf。seq。values,maxlen6000) TextCNN训练和预测模块引入fromsklearn。modelselectionimportStratifiedKFold,KFoldskfKFold(nsplits5,shuffleTrue)TextCNN传入的参数值maxlen6000maxcnt295embedsize256numfilters64kernelsize〔2,4,6,8,10,12,14〕convactionrelumaskzeroFalseTRAINTrueimportosos。environ〔CUDAVISIBLEDEVICES〕0,1zeros()指创建指定长度或形状的全为0的ndarray数组;在默认情况下,zeros()创建的数组元素类型为浮点型,如果要使用其他类型可以设置dtype参数返回给定类型的新数组;shape:定义返回数组的形状;创建多维数组时,用括号将shape数据组括起来shape(len(trainseq),8)指的是创建len(trainseq)行、8列的矩阵metatrainnp。zeros(shape(len(trainseq),8))metatestnp。zeros(shape(len(testseq),8))FLAGTruei0skf。split()fortrind,teindinskf。split(trainlabels):i1print(FOLD:。format(i))print(len(teind),len(trind))modelnamebenchmarktextcnnfoldstr(i)Xtrain,Xtrainlabeltrainseq〔trind〕,trainlabels〔trind〕Xval,Xvallabeltrainseq〔teind〕,trainlabels〔teind〕textCNN是使用卷积神经网络来进行文本分类,属于CV领域,是用于解决计算机视觉方向问题的模型modelTextCNN(maxlen,maxcnt,embedsize,numfilters,kernelsize,convaction,maskzero)modelsavepath。NNss。hdf5(modelname,embedsize)earlystoppingEarlyStopping(monitorvalloss,patience3)modelcheckpointModelCheckpoint(modelsavepath,savebestonlyTrue,saveweightsonlyTrue)ifTRAINandFLAG:fit()函数是机器学习中用于拟合模型的函数,主要输入的是训练集数据,并根据算法拟合出一个模型。最终输出的是一个训练好的模型model。fit(Xtrain,Xtrainlabel,validationdata(Xval,Xvallabel),epochs100,batchsize64,shuffleTrue,callbacks〔earlystopping,modelcheckpoint〕)loadweights的作用是将预训练好的权值文件加载到模型中model。loadweights(modelsavepath)Predict函数是一种常用的机器学习技术,它可以帮助我们准确地预测未来事件的发生概率。它是基于历史数据,从历史数据中提取特征,并使用机器学习算法来预测未来可能发生的事件。Predict函数主要用于分析和预测未来事件的发生概率,以及给出相关的建议和措施。predvalmodel。predict(Xval,batchsize128,verbose1)predtestmodel。predict(testseq,batchsize128,verbose1)将预测的数据赋值到metatrain中metatrain〔teind〕predval将测试集每个fileid对应的api集合预测的数据累加存放到metatest中metatestpredtestK。clearsession()将销毁当前的TF图并创建一个新的TF图K。clearsession()metatest5。0结果提交testdf〔prob0〕0testdf〔prob1〕0testdf〔prob2〕0testdf〔prob3〕0testdf〔prob4〕0testdf〔prob5〕0testdf〔prob6〕0testdf〔prob7〕0testdf〔〔prob0,prob1,prob2,prob3,prob4,prob5,prob6,prob7〕〕metatesttestdf〔〔fileid,prob0,prob1,prob2,prob3,prob4,prob5,prob6,prob7〕〕。tocsv(nnbaseline5fold。csv,indexNone) 博客参考链接:阿里云天池大赛赛题(机器学习)阿里云安全恶意程序检测(完整代码)全栈OJay的博客CSDN博客 头条创作挑战赛