· Nov 2, 2016

iKnow - Text categorization "Category 1 covers the whole dataset"


I am using iknow text categorization to classify texts. I have 11 medical articles as my training set. Here is part of the source code:

  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.File.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
  SET dirpath = "D:\iKnowTestCase\SmallDataBase\Medical"
  SET stat = myloader.SetLister(flister)
  SET stat = myloader.ProcessList(dirpath,$LB("txt"),0,"")
  IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT 
  SET tTrainingSet = ##class(%iKnow.Filters.RandomFilter).%New(domId, 0.7)
  SET tTestSet = ##class(%iKnow.Filters.GroupFilter).%New(domId, "AND", 1) // NOT filter
  DO tTestSet.AddSubFilter(tTrainingSet)
  SET numSrcFD=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId,tTrainingSet)
  WRITE "The training set includes ",numSrcFD," sources",!
  SET numSrcTD=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId,tTestSet)
  WRITE "The test set includes ",numSrcTD," sources",!
  SET tBuilder = ##class(%iKnow.Classification.IKnowBuilder).%New("MIT",tTrainingSet)
  SET tBuilder.ClassificationMethod = "naiveBayes"
  SET mstat = tBuilder.%AddCategory("Medical","",5)
  SET stat = tBuilder.%GetCategoryInfo(.pcategories)
  WRITE $LISTTOSTRING(pcategories(1)),!
  SET stat = tBuilder.%CreateClassifierClass("MIT.Classifier",1,1,1,1)
  WRITE "In the Optimizer",!
  SET tOpt = ##class(%iKnow.Classification.Optimizer).%New(domId,tBuilder)
  WRITE "optimizer is ",tOpt,!
  SET tOpt.ScoreMetric="MicroPrecision"
  WRITE "ScoreMetric is ",tOpt.ScoreMetric,!
  DO ##class(%iKnow.Queries.EntityAPI).GetTop(.result,domId,1,50)
  WRITE "load terms ",tOpt.LoadTermsArray(.result),!
  SET optstat=tOpt.Optimize(5)
  WRITE "Optimize status: ",optstat,!
  WRITE "End of Optimizer",!!
  DO ##class(%iKnow.Classification.Utils).%RunModelFromDomain(.r,
  SET i=1
  WHILE $DATA(r(i)) {
       WRITE $LISTTOSTRING(r(i),",")
       SET i=i+1 }
  WRITE tBuilder.%TestClassifier(tTestSet,.testresult,.accuracy),!
  WRITE "model accuracy: ",$FNUMBER(accuracy*100,"L",2)," percent",!
  SET n=1
  SET wrongcnt=0
  WHILE $DATA(testresult(n)) {
    IF $LISTGET(testresult(n),2) '= $LISTGET(testresult(n),3) {
     SET wrongcnt=wrongcnt+1
     WRITE "WRONG: ",$LISTGET(testresult(n),1)
     WRITE " actual ",$LISTGET(testresult(n),2)
     WRITE " pred. ",$LISTGET(testresult(n),3),! }
  SET n=n+1 }
  WRITE wrongcnt," out of ",n-1,!

When I run the function: tBuilder.%CreateClassFierClass("MIT.Classfier", 1,1,1,1)

I got the error: category 1 covers the whole data. I can not find this in the documentation.

Thank you.


Discussion (2)1
Log in or sign up to continue

When you are using the Text Categorization, you need to have a piece of meta data that is used to group the text into different categories. "Gender", or "Month" or, "Diagnosis Code", etc.  Then each record has to have one of these values associated with it, so the learning process can determine what concepts/terms go with which category.

You will get the "category 1 covers the whole data" error, if you don't have a meta field either defined, or correctly populated. Without some level of variability in the "category" identified meta data field, the machine learning doesn't have reference point to sort out your text records.

Make sure that you both HAVE a Meta-Data element defined, and that there are differing values within the record set that you are using to create the categorizations.

What is the attribute of your texts that you want to categorize? When building TC models on top of iKnow domains, as Chip indicated, the easiest way is to do this based on a metadata field using the %LoadMetadataCategories method on your builder object. The more manual method is using these %AddCategory methods you're using, but they require a filter specification as the second argument (you passed "") to identify which records correspond to the category you're adding. That's what's making it a training set, a set of texts for which the outcome (category) is known, so the TC infrastructure can try to find corellations to exploit. 

Separately, 11 records is a very small training set. I would not expect to find much statistically relevant information to exploit. Even when you're building a rule-based model manually, it'll be barely enough to validate your model. So probably it's worth trying to get hold of (much) more data to start with.

I've uploaded a tutorial we prepared for a Global Summit academy session on the subject in a separate thread (could not attach it to an existing one). That may give you a better idea of what the infrastructure can be used for and also takes you through the GUI, which may be more practical than the code-based approach you're referring to.