Welcome and questions on Stage One
a_silmariel
silmariel at telefonica.net
Mon Feb 23 22:39:52 UTC 2004
If you receive this twice, you know who to blame and it's not Barry
this time.
I'll try to clarify some things. It is not that we have to choose one
approach
or the other. We'll use excel first because it's easier, but any work
done
will be put in a mysql database as soon as it's made.
And of course, Barry, if you prefer to work directly with the group
database,
do so, it provides a nice 'Export Table' function.
No, excel is not a database, as Barry has pointed out it is not nearly as
effective in data treatment. After all, that's what databases are
designed
for.
What I really want is that no one has to do the same work twice.
STAGE ONE:
We have the first set of information that from my PoV is the worst work,
because it needs reading, rejecting and categoricing 100000 posts that
are
kept prisoner online by yahoomort.
What information do we have in the spreadsheet?
1 -- Post information:
The raw material extracted from the message index, copied and pasted
into a
spreadsheet.
2 -- Cataloguing Information:
Codes for rejection and up to six categories. Six categories is a random
desition. Just using the number of the category is enough. Name can be
used
or not but is not necessary.
3 -- Control information:
It is not in the spreadsheet, but I strongly suggest that we include
another
code for the rewiever, from the start.
> - what is the best size of spreadsheet to work with ? (Units of
> 500, 1000 posts or whatever ? Threads are no respecter of arbitrary
> boundaries.. it has to be easy to follow on from one sheet to the
> next when tracing an argument)
The actual size of the manageable file is variable. Some programs
start having
problems while managing too big documents. From my side it is
indifferent,
each spreadsheet can contain a different number of posts.
I say indefferent because proofs of reliability against the mysql
database are
made with millions of registers, so from the technical side, all posts
can go
in one spreadsheet.
> - where should the Excel sheets be based - on our individual
> computers, or on a server somewhere ? They certainly need to end up
> in one central place for obvious reasons.
Now let's save that Excel (or spreadsheet) into .cvs format. You can
make it
as long as you want, the only restriction is one sheet per spreadsheet.
Then send me the .csv document.
I'll put that info into a database, retrieve it in any number of ways,
and
make .cvs that can be opened by spreadsheets.
I'll have a database, and can open it to remote access for those brave
users
who want to work from a text-only remote console.
STAGE TWO:
Making it possible on-line.
That's what I should have been doing this weekend, planning this in a
more
detailed way.
One option is to access remotely to the database, via a client, but I
have not
looked for nice graphical utilities that could already exist.
Other is to access trough formularies in a web page.
[ ].....database and the website we use to access it, to enable
> us to not only work simultaneously on it, but also to overcome any
> computer incompatibility problems that might crop up, especially as
> more people work on the project as we go on. My questions relating to
> this solution are:
> - Can the archive posts just be dumped into such a database,
> and would you do it all at once, or in orderly chunks of eg 1000 at a
> time?
> - Where would the database be located (what server, as before
> with the Excel sheets)
> - How long would it take?
The database will be already set up. Dump it entirely, send it to another
computer, start a Mysql database (runs in Linux and Windows and it's free
use, no licenses to pay), grab the dump you made and insert it into the
2.database. This should be done in a day.
> - Which website would we use to access it and would there be
> any major problems in multiple people accessing it any one time?
I think limiting up to 30 users at time is the standard security
measure to
avoid DoS attacks for small servers, but my plans include a very
'light' (in
the sense of amount of data to be transmitted) web interface, we
should not
be a problem for the http server.
The database can be in another computer or in the same.
Basically we need a fixed IP number, a 256kbps 24h/day connection and a
computer. With that and Apache we are independent. Apache is an http
server
running both in Windows and Linux and also free, and I can set it up
or get
someone to do it.
Barry, which Java version does your Mac run?
STAGE THREE:
Absorbing the whole posts in the database, not only the index information.
I have no idea of how to do it from yahoo, insincerely (I have an idea
but
surely is ilegal). I'd be glad to hear what Paul has to say.
Insert that new info in the database.
Absorbing previous groups in the database. This will create a conflict
with
post numbers that somehow will be solved.
I'm tired, it is late, but I though I had to try to explain some
things. The
key is no matter how you do it, your work won't be lost. Either trough
the
group database (I will iron my hands a little for not having proposed it
myself) or a spreadsheet.
I know all of you have questions. Shoot.
Carolina
More information about the HPFGU-Catalogue
archive