[Textop-en-phil] Outlining vs. tagging and indexing
Larry Sanger
larry.sanger at dufoundation.org
Tue Aug 8 17:19:54 PDT 2006
(Was RE: [Textop-en-phil] [Textop] FAQ posted)
Since this came up in Clea's mail, and since I know it's an important
topic, I thought I would discuss it next.
> As far as I understand you ARE asking experts to tag
> information just in an inefficient manner - not to imply
> that's necessarily a bad thing. The tags you are asking them
> to give are: a fine-grained categorization, a linguistic
> function, a summary, and the source information - all meta
> data and each can be represented as a tag.
>
> I understand why you need to have an agreed set of tags for
> the categorization, but the highly touted ESP game results in
> an agreed set of tags for images http://www.espgame.org/. So
> that can't be the biggest barrier.
There is a very definite and straightforward, although not completely
simple, answer to this. To oversimplify: extensibility is *the* reason
that outlining is necessary for the Collation Project (CP).
The basic aim of the CP is to group similar *chunks* (as I've defined
them) together. Furthermore, as the number of texts in the database
grows, the similarity of any two chunks in the database grows: we do not
want the number of texts under a given tag or index word to become so
great that the user is presented with an undifferentiated mass of
chunks. ("Undifferentiated mass of chunks"--this is making me
nauseous!) If we make categories that have more than a dozen or so
entries, we wind up ignoring the very purpose of grouping chunks in the
first place.
Furthermore, for the same reason--we want to make it as easy as possible
for people to find chunks that concern the precise same topics--we
cannot use categories that overlap too much. Perhaps we can tolerate
*some* overlap, but we cannot tolerate much. Again, consider what
happens when we place thousands of books (and millions of chunks) into
the system. Unless there is a way of delineating categories so that
they are for the most part mutually exclusive, we will end up with a
del.icio.us (but not very tasty) "tag soup," in which tags, or index
items, are constantly added to the mix, and previously-collated works
constantly need to be revisited in order to make sure they are
completely tagged.
So there are sorts of two constraints here: vertical constraints, which
require us to create more specialized, "tailor-made" categories in order
to accommodate growing numbers of chunks; and horizontal constraints,
which require us to make sure there are reasonably clear "boundaries"
between categories.
*The* way to meet both constraints is to create an outline of
increasingly specialized information. An outline, in which categories
are sorted hierarchically according to definite relations (eventually!),
at once allows for easy "vertical" expansion as necessary, into more
specialized topics, and allows for a sort of "conceptual bookkeeping"
necessary for making sure there is no unnecessary duplication or overlap
between categories.
Basically, tagging systems and index systems both must behave *like* an
outline if they are to satisfy the constraints.
For instance, to avoid new users creating new tags, that are virtually
synonymous with old tags (or unnecessarily overlapping), users must be
presented with a way to search through the old tags. A mere search
engine will not do the trick, because people may not know what to search
for. So there needs to be groupings, or some other device that allows
people to discover what similar tags already exist. An outline is far
and away the most efficient way to achieve this purpose.
The appeal of both tagging and indexing is clearly the *apparent*
efficiency of the process. You simply slap a tag on a widget, and
enough people do that with enough widgets, and interesting patterns
emerge. I don't deny that. What I do deny is that, simply because
interesting patterns emerge from collective tagging, the two constraints
above can be met via collective tagging. It's virtually impossible for
the usual sort of tagging system, anyway, to satisfy the constraints.
> The question I've had is why you chose to break apart an
> existing text and put it into an outline, as opposed to
> marking up the text with tags which is similar to the
> low-tech method I used in my textbooks.
First, notice on the screenshot (http://www.textop.org/screenshot.html)
that I envision the software working so that it's nearly as simple as
tagging. You select a bit of the text, write a function and summary,
and then drag the resulting chunk into the (editable) outline.
The reason that we need to break up the text and group chunks under
headings, as you can see in the second ("Current Node") column of the
screenshot, is that there is a mutually dependent relationship between
headings and nodes filed under the headings. Particularly if we must
constantly edit the outline, we will have to revisit the chunks that are
filed in the parts of the outline we are editing (I've already had to do
this with the Leviathan outline). Collecting the chunks together makes
this easy.
Furthermore, collecting the chunks together provides an excellent
explanation of what an outliner means by a given heading. If we merely
slap tags onto chunks, even if we must pick from a list, and we do not
compare the chunks to the other chunks filed under the heading, we can
easily misfile the chunk. I am constantly looking at what I have filed
where to determine where a new chunk should be filed.
> The conclusion I came to was because of the benefits of the
> unique view, allowing people to make new connections within
> and between domains. And that collaborative tagging
> technology just isn't there yet for the purposes of text-op.
That's right (if I understand you correctly). A unitary outline can
bring together people of differing viewpoints, disciplines, and
languages in a way that tagging and indexing cannot do so easily.
> My husband is working on his masters thesis in a related area
> and we've brainstormed on many parts, and found that there
> are really no tools we know of that allow you, even
> personally, to tag text at the paragraph or sentence
> level...let alone have many people collaborate on it. I
> think the closest we get to this is something like
> del.icio.us http://del.icio.us/.
I've looked too, and I agree, there just isn't anything.
> I think you could accomplish the goals of text-op through a
> smartly designed tagging system, however the current
> technology fails to meet the criteria of collaborative,
> limited categorization and sentence level granularity, thus
> another method, at least for the pilot project, needed to be found.
Well, if you (and/or your husband) could explain *how* a tagging system
could be "smartly designed" so as to satisfy the above constraints, I'd
be very interested. Or if you think my constraints are not precisely
what they should be, and that I should have slightly different
constraints that you *can* satisfy, I'd be interested in that, too.
> In my experience wikis are best used for rapid co-creation of
> content, but I don't view text-op as a content creation
> problem, rather as a meta-data problem (maybe this is wrong,
> but in my head it seems a better fit) - which seems to be the
> bottleneck in loads of web-based research at the moment.
> Hopefully some better tools for this type of problem will
> arise in the next few years - maybe even out of my husband's
> research lab or one of my many pet projects .
We're using a wiki ONLY for the pilot project, simply because we
presently lack the tool we need. You can read about the software I
think we need here: http://www.textop.org/reqs_v1.html
The Collation Project involves many challenges, some of them are
definitely content creation problems (how do you motivate a lot of
people to read carefully and summarize texts according to a very
specific standard?) and some of them are definitely metadata problems
(what metadata should be included?).
> So that was my internal argument, and I thought you might
> garner something from that, even if it is just how off base I
> am. As I found out if you have to try to convince computer
> science people of the validity of this project...tagging will
> be one of the major hangups.
I agree, it will be difficult. Eventually, though, I think people will
understand. As in the case of Wikipedia, it is the vivid example that
teaches the concepts to those who don't get them right away. Basically,
people analyzing "what makes Wikipedia work" are reverse-engineering the
actual thought that we put into deciding on and implementing the
policies of Wikipedia. ;-)
--Larry
More information about the Textop-en-phil
mailing list